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« The MAILING DATE of this communication appears on the cover sheet with the correspondence address 
Period for Reply 

A SHORTENED STATUTORY PERIOD FOR REPLY IS SET TO EXPIRE 3 MONTH(S) FROM 
THE MAILING DATE OF THIS COMMUNICATION. 

- Extensions of time may be available under the provisions of 37 CFR 1 .136(a). In no event, however, may a reply be timely filed 
after SIX (6) MONTHS from the mailing date of this communication. 

- If the period for reply specified above is less than thirty (30) days, a reply within the statutory minimum of thirty (30) days will be considered timely. 

- If NO period for reply is specified above, the maximum statutory period will apply and will expire SIX (6) MONTHS from the mailing date of this communication. 

- Failure to reply within the set or extended period for reply will, by statute, cause the application to become ABANDONED (35 U.S.C. § 133). 
Any reply received by the Office later than three months after the mailing date of this communication, even if timely filed, may reduce any 
earned patent term adjustment. See 37 CFR 1 .704(b). 

Status 

1 )S Responsive to communication(s) filed on 10 September 2001 . 
2a)D This action is FINAL. 2b)[3 This action is non-final. 

3) D Since this application is in condition for allowance except for formal matters, prosecution as to the merits is 

closed in accordance with the practice under Ex parte Quayle, 1935 CD. 1 1 , 453 O.G. 213. 

Disposition of Claims 

4) ^ Claim(s) 1-26 and 28-31 is/are pending in the application. 

4a) Of the above claim(s) 27 and 32 is/are withdrawn from consideration. 

5) Q Claim(s) is/are allowed. 

6) IEI Claimfe) 1-5.7-13.15.18-20.24-26 and 28-31 is/are rejected. 

7) IEI Claim(s) 6.14.16.17.21-23 is/are objected to. 

8) D Claim(s) are subject to restriction and/or election requirement. 

Application Papers 

9) [3 The specification is objected to by the Examiner. 

10)^ The drawing(s) filed on 10 September 2001 is/are: a)D accepted or b)^ objected to by the Examiner. 

Applicant may not request that any objection to the drawing(s) be held in abeyance. See 37 CFR 1 .85(a). 

Replacement drawing sheet(s) including the correction is required if the drawing(s) is objected to. See 37 CFR 1.121(d). 
1 1 )E3 The oath or declaration is objected to by the Examiner. Note the attached Office Action or form PTO-1 52. 

Priority under 35 U.S.C. § 119 

12)^ Acknowledgment is made of a claim for foreign priority under 35 U.S.C. § 1 19(a)-(d) or (f). 
a)KI All b)D Some * c)D None of: 

1 Certified copies of the priority documents have been received. 

2.Q Certified copies of the priority documents have been received in Application No. . 



3.Q Copies of the certified copies of the priority documents have been received in this National Stage 
application from the International Bureau (PCT Rule 17.2(a)). 
* See the attached detailed Office action for a list of the certified copies not received. 



Attachment(s) 

1 ) ^ Notice of References Cited (PTO-892) 4) □ Interview Summary (PTO-41 3) 

2) □ Notice of Draftsperson's Patent Drawing Review (PTO-948) Paper No(s)/Mail Date. . 

3) M Information Disclosure Statement(s) (PTO-1449 or PTO/SB/08) 5 ) D Notice of Informal Patent Application (PTO-1 52) 

Paper No(s)/Mail Date 9/10/2001 . 6) □ Other: . 



U.S. Patent and Trademark Office 
PTOL-326 (Rev. 1-04) 
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DETAILED ACTION 

Claims 1-26 and 28-31 have been examined. 
In a Preliminary Amendment 

Claims 1 - 26 and 28-31 were amended. 

Claims 27 and 32 were cancelled. 

Priority 

1 . Receipt is acknowledged of papers submitted under 35 U.S.C. 1 19(a)-(d), which papers 
have been placed of record in the file. 

Information Disclosure Statement 

2. The Information Disclosure Statement (IDS) filed December 6, 2001 has been 
considered. The reference in French could not be considered. 

Oath/Declaration 

3. Applicant has elected to use an outdated version of 37 CFR 1.56 "(as amended effective 
March 16, 1992)". Applicant should use the current form on the USPTO.GOV website when 
submitting a new Declaration. 

Drawings 

4. New corrected drawings in compliance with 37 CFR 1 .121(d) are required in this 
application because they are fuzzy and hard to read and will not display properly in a U.S. 
Patent. Applicant is advised to employ the services of a competent patent draftsperson outside 
the Office, as the U.S. Patent and Trademark Office no longer prepares new drawings. The 
corrected drawings are required in reply to the Office action to avoid abandonment of the 
application. The requirement for corrected drawings will not be held in abeyance. 
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5. Figure 1 should be designated by a legend such as -Prior Art— because only that which 
is old is illustrated. See MPEP § 608.02(g). Corrected drawings in compliance with 37 CFR 
1.121(d) are required in reply to the Office action to avoid abandonment of the application. The 
replacement sheet(s) should be labeled "Replacement Sheet" in the page header (as per 37 CFR 
1.84(c)) so as not to obstruct any portion of the drawing figures. If the changes are not accepted 
by the examiner, the applicant will be notified and informed of any required corrective action in 
the next Office action. The objection to the drawings will not be held in abeyance. 

Specification 

6. The abstract of the disclosure is objected to because must be on a separate page. 
Correction is required. See MPEP § 608.01(b). 

7. Preliminary amendment of September 10, 2001 has been entered. 

8. The disclosure is objected to because of the following informalities: The spelling of 
several words is not in the format for United States English, the European spelling of the 
following must be changed. 

European Spelling United States English Spelling 

"analysing" analyzing 
"analysed" analyzed 
"reinitialised" reinitialized 
"reinitialisation" reinitialization 
Correction will benefit the searching of U.S. Patent literature. 
Appropriate correction is required. 
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9. Page 5 of the Specification contains the acronym "FIBs", without the term being fully 
spelt out. On common meaning is "Secured hash standard, Federal Information Processing 
Standards Publication (FIBS) 180-1, May 1994". Clarification required with a change to the 
Specification. 

10. The use of the trademark "JAVA" has been noted in this application. It should be 
capitalized wherever it appears and be accompanied by the generic terminology. 

Although the use of trademarks is permissible in patent applications, the proprietary 
nature of the marks should be respected and every effort made to prevent their use in any maimer 
which might adversely affect their validity as trademarks. 

1 1 . The title of the invention is not descriptive. A new title is required that is clearly 
indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 112 

12. The following is a quotation of the second paragraph of 35 U.S.C. 1 12: 

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the 
subject matter which the applicant regards as his invention. 

13. Claims 8 - 10 are rejected under 35 U.S.C. 1 12, second paragraph, as being indefinite for 
failing to particularly point out and distinctly claim the subject matter which applicant regards as 
the invention. The problem is the Applicant states the program to be monitored (DATA) . The 
focus of the claim language should the functionality of the monitor program and how it handles 
the varies condition presented by the input as it is processed. The Specification clearly supports 
what the Applicant is attempting to claim. This claim as written is indefinite. Dependent claims 
are also rejected merely because they are dependent on claim 8. 
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Claim 8 

A method according to Claim 1 wherein, when the program to be monitored provides for at least 
one jump, the monitoring method is applied separately to sets of instructions in the program 
which do not include jumps between two successive instructions. 

Claim Rejections - 35 USC §102 

14. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the 
basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 

(b) the invention was patented or described in a printed publication in this or a foreign country or in public use or on 
sale in this country, more than one year prior to the date of application for patent in the United States. 

15. Claims 1 -5, 7-13, 15, 18-20, 24-26, 28, 30 and 31 are rejected under 35 U.S.C. 102(b) as 
being anticipated by USPN # 4,266,272 Berglund et al (IDS). 

The environment of the invention JAVACARD is not claimed but is vastly different than the 
environment of the IDS reference 
Claim Interpretation 

The control circuitry in the reference EDS is performing the monitor function of the claimed 
invention. 

Claim 1 

IDS anticipates a method for monitoring progress with the execution of a linear sequence of 
instructions in a computer program (IDS, Abstract, control circuitry ), comprising the steps of 
analysing the sequence of instructions transmitted to a processor intended to execute the program 
being monitored by extracting a data item from each instruction transmitted to the processor 
(IDS, Abstract, check word ) and performing a calculation on said data item (IDS, Abstract, 
dynamically calculated ), and verifying, the result of this analysis by comparing the result of said 
calculation to reference data (IDS, Abstract, local storage register vs. ALU ), recorded with said 
program, wherein the reference data comprises a value pre-established so as to correspond to the 
result of the analysis produced during the monitoring method only if all the instructions in the 
sequence of instructions have actually been analysed during the running of the program (IDS, 
Abstract, control storage ). 
Claim Interpretation 

The limitation "of a linear sequence of instructions" is not given patentable weight because it is 
dependent on the form of the input. Not part of the invention. It is treated as data. 
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Claim 2 

A method according to Claim 1, wherein the verification of the result of the analysis is caused by 
an instruction placed at a predetermined location in the program to be monitored ( as per claim 1 
a register is a predetermined location), said instruction containing the reference data relating to a 
set of instructions whose correct execution is to be monitored (registers are inherently related to 
the instruction being processed). 

Claim 3 

A method according to Claim 1 wherein, when the instructions of the set of instructions to be 
monitored are in the form of a value, said analysis of the instructions is carried out by using these 
instructions as a numerical value. (Interpretation - all values are in binary format - this is 
inherent). 

Claim 4 

A method according to Claim 1, comprising the steps of: 

- during the preparation of the program to be monitored ( as per claim 1): 

- incorporating, in at least one predetermined location in a sequence of instructions (as per claim 
1) in the program, a reference value established according to a predetermined rule applied to 
identifiable data in each instruction to be monitored ( as per claim 1, identification of words), 
and during the execution of the program to be monitored ( as per claim 1): 

- obtaining said identifiable data in each instruction received for execution (IDS, fetch col 9, 10- 
30), 

- applying said predetermined rule to said identifiable data thus obtained in order to establish a 
verification value ( as per claim 1), and 

- verifying that this verification value actually corresponds to the reference value recorded with 
the program ( as per claim 1). 

Claim 5 

A method according to Claim 1, further comprising a step of interrupting the flow of the program 
if the analysis reveals that the program being monitored has not been run as expected. (IDS, 
Figure #4, Result ERROR from result branch). 

Claim 7 

A method according to Claim 1 wherein the set of instructions to be monitored does not include 
jumps in its expected flow. 
Claim Interpretation 

The limitation "set of instructions to be monitored does not include jumps in its expected flow" 
is not given patentable weight because it is dependent on the form of the input. Not part of the 
invention. It is treated as data. 

Claim 8 

A method according to Claim 1 wherein, when the program to be monitored provides for at least 
one jump, the monitoring method is applied separately to sets of instructions in the program 
which do not include jumps between two successive instructions ( IDS, col 9, lines 10 - 40). 
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Claim Interpretation 

The limitation "program to be monitored provides for at least one jump" is not given patentable 
weight because it is dependent on the form of the input. Not part of the invention. It is treated as 
data. 

Claim 9 

A method according to Claim 8, wherein, when the program to be monitored includes an 
instruction for a jump dependent on the manipulated data, the monitoring method is implemented 
separately for a set of instructions which precedes the jump, and for at least one set of 
instructions which follows said jump. As per claim 8. 

Claim 10 

A method according to Claim 9, wherein, for a set of instructions providing for a jump, an 
instruction which controls this jump is integrated in said set of instructions for the purpose of 
obtaining a verification value for thus set of instructions before executing the jump instruction, 
as per claim 8. 

Claim 11 

A method according to Claim 1 wherein the analysis is reinitialised before each new monitoring 
of a sequence of instructions to be monitored. (IDS, cycle and incrementer, col 3, lines 40 - 60) 

Claim 12 

A method according to Claim 11, wherein the reinitialisation of the analysis of each new 
monitoring includes the step of erasing or replacing a verification value obtained during a 
previous analysis. As per claim 1 1 depending on cycle determination. 

Claim 13 

A method according to Claim 1 1 wherein the reinitialisation of the monitoring analysis is 
controlled by the software itself. (Interpretation - the control circuitry and software being 
executed has a functional relationship - This is deemed inherent and related to Examiner's note 
above) 

Claim 15 

A method according to Claim 1 wherein the analysis includes the step of calculating, for each 
instruction under consideration following a previous instruction, the result of an operation on 
both a value obtained of the instruction in question and the result obtained by the same operation 
performed on the previous instruction. As per claim 1. 

Claim 18 

A method according to Claim 1 wherein the analysis includes the step of obtaining a comparison 
value by calculating successive intermediate values as the data of the respective instructions are 
obtained. (IDS, Abstract, last sentence words is plural). 



Claim 19 
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A method according to Claim 1 wherein the analysis comprises a step of saving each data item 
necessary for verification, obtained from instructions in the set of instructions to be monitored as 
they are executed, and performing a calculation of a verification value from these data only at the 
necessary time, once all the necessary data have been obtained. ( as per claim 1 and details of 
fetch col 3, lines 10-30). 

Claim 20 

IDS anticipates a device for monitoring progress with the execution of a series of instructions of 
a computer program, comprising means for analysing the sequence of instructions transmitted to 
the processor intended to execute the program being monitored by extracting a data item from 
each instruction transmitted to the processor and performing a calculation on said data item, and 
means for verifying the result of this analysis by comparing the result of said calculation to 
reference data recorded with said program, wherein the reference data comprises a value pre- 
established so as to correspond to the result of the analysis produced during monitoring only if 
all the instructions in the sequence of instructions have actually been analysed during Hie running 
of the program. As per claim 1. 
Claim Interpretation 

The limitation "a series of instructions of a computer program" is not given patentable weight 
because it is dependent on the form of the input. Not part of the invention. It is treated as data. 

Claim 24 

A device according to Claim 20 that is integrated into a programmed device containing said 
program to be monitored. (IDS, Abstract, Control Circuitry). 

Claim 25 

A device according to Claim 20 that is integrated into a program execution device. (IDS, 
Abstract, Control Circuitry). 

Claim 26 

IDS anticipates a program execution device that executes a series of instructions of a computer 
program, comprising means for analysing the sequence of instructions transmitted for execution 
by extracting a data item from each instruction and performing a calculation on said data item, 
and means for verifying the result of this analysis by comparing the result of said calculation to 
reference data recorded with the program to be monitored, wherein the reference data comprises 
a value pre-established so as to correspond to the result of the analysis produced during 
monitoring only if all the instructions in the sequence of instructions have actually been analysed 
during the running of the program. As per claim 1 . 
Claim Interpretation 

A. The limitation "a series of instructions of a computer program" is not given patentable weight 
because it is dependent on the form of the input. Not part of the invention. It is treated as data. 

B. In a similar fashion. The limitation "correspond to the result of the analysis produced during 
monitoring only if all the instructions in the sequence of instructions have actually been analysed 
during the running of the program" can be dependent on the input. If the program is only s few 
statement which all statements are to execute the claim limitations are input dependent. The 
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claim limitations outside the fact a monitor function is present have not performed a non required 
step to distinguish it from a monitor. 

Claim 28 

IDS anticipates a programmed device containing a series of recorded instructions and a fixed 
memory containing reference data pre-established as a function of data contained in said 
instructions for analysis and verification of the sequence of instructions, wherein the reference 
data comprises a value pre-established so as to correspond to the result of the analysis produced 
during monitoring only if all the instructions in the sequence of instructions have actually been 
analyzed during the running of the program, as per claim 1. 
Claim Interpretation 

A. The limitation "a series of recorded instructions" is not given patentable weight because it is 
dependent on the form of the input. Not part of the invention. It is treated as data. 

Claim 30 

A device according to Claim 28 wherein the reference data are recorded in the form of a 
prewired value or values fixed in memory . (IDS , Abstract, last sentence). 
Claim Interpretation 

The presence of the OR in the limitations, the Examiner elects to reject the underlined limitation 
above. 

Claim 31 

A device for programming a programmed device according to Claim 28, comprising means for 
entering, in at least one predetermined location in a sequence of instructions in the program, a 
reference value calculated according to a preestablished mode from data included in each 
instruction in a set of instructions whose execution is to be monitored. As per claim 1 . 

Claim Rejections - 35 USC § 103 

16. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

17. Claim 29 is rejected under 35 U.S.C. 103(a) as being unpatentable over IDS in view of 
USPN# 6,402,028. 



Claim 29 
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IDS teaches a device but IDS does not teach the device is a smart card, according to Claim 28, 
wherein said device is a smart card. USPN 6,402,028 teaches the production of Smart Cards 
where the logic is on the card, therefore, it would have been obvious to one of ordinary skill in 
the art at the time of invention, to combine IDS with 6,402,028 because logic control for Smart 
cards makes Smart Cards more reliable. 



Allowable Subject Matter 
18. Claims 6, 14, 16, 17 and 21 - 23 are objected to as being dependent upon a rejected base 
claim, but would be allowable if rewritten in independent form including all of the limitations of 
the base claim and any intervening claims. The bold and underlined limitations below indicate 
limitations not found in the prior aft of record. 

Claim 6 

A method according to Claim 1, further comprising an invalidation step for future use of the 
device comprising the monitored program if said analysis reveals a predetermined number of 
times that the program being monitored has not run in the expected manner . 

Claim 14 

A method according to Claim 1 wherein the analysis produces a verification value obtained as 
the last value in a series of values which is made to change successively with the analysis of 
each of the analysed instructions of the set of instructions, thus making it possible to 
contain an internal state of the running of the monitoring method and to follow its changes . 

Claim 16 

A method according to Claim 1 wherein the analysis includes the step of recursively applying a 
hash function to values obtained of each monitored instruction, starting from a last 
initialisation performed. 

Claim 17 

A method according to Claim 1 wherein the analysis includes the step of making a verification 
value change by performing a redundancy calculation on all the operating codes and the 
addresses executed since the last initialisation was carried out. 



Claim 21 

A device according to Claim 20, further including a register for recording intermediate results in 
a calculation in a chain carried out by the analysis means in order to obtain a verification 
value. 



Application/Control Number: 09/936,174 
Art Unit: 2124 



Page 11 



Claim 22 

A device according to Claim 21, further comprising means for recording a predetermined value 
or resetting the register under the control of an instruction transmitted during the execution of a 
program to be monitored. (Dependent on claim 21) 

Claim 23 

A device according to Claim 20, further comprising means for counting the number of 
unexpected events in the program being monitored, as determined by the analysis means, and 
means for invalidating the future use of the program to be monitored if this number 
reaches a predetermined threshold. 

Conclusion 

19. The prior art made of record and not relied upon is considered pertinent to applicant's 

disclosure. 

US Patent Literature 

A. 6,402,028 - Deals with mass production of Smart Cards Column 4 covers JAVACARD 
technology. 

B. 6,668,325 - Employs an obfuscation technique on a section of code. Environment is 
distributed. 

C. 5,974,549 - Monitor is implemented via Dynamic Link Library (DLL). 

D. 6,546,546 - Appears to be dependent on the extensible operating system disclosed 
(PARAMECIUM). 

E. 6,092,120 - Focus on class loaders. 

F. 6,327,700 - Based on Profile data. 

G. 6,557,168 - Monitor is included at class level not at low level as per disclosed invention. 



Application/Control Number: 09/936,174 



Page 12 



Art Unit: 2124 

H. 6,275,938 - The monitor environment runs at operating system level not processor level as 
disclosed invention. 



20. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Todd Ingberg whose telephone number is (571) 272-3723. The 
examiner can normally be reached on during the work week.. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the 
organization where this application or proceeding is assigned is 703-872-9306. 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-2 1 7-9 1 97 (toll-free). / 



Correspondence 




Todd Ingbetg / 
Primary Examiner 
Art Unit 2124 
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Chapter 1 

Building the First 
Java Examples 



Welcome.to Java 2! An ambitious agenda lies before 
you: You're going to get a firm grip on Java program- 
ming, creating both powerful Java programs and Web 
pages, and you will take a guided tour through Java 2. There is ? 
no more exciting programming package available. As you are V 
probably aware, the popularity of Java has skyrocketed as more 
and more people have seen how versatile and powerful it is. Web 
programmers have found it an excellent tool because it allows 
them to write programs that will run on many different types of \ 
^ computers. They have started using it to make their Web pa-ges"^ 
•'^'''actually db something. • A ^ / 



Adapted from Java 2: In Record Time by Steven Holzner 
ISBN 0-7821-2171-3 560 pages $29.99 



4 Chapter One 



With Java, you will be able to display animation and images, accept 
mouse clicks and text, use controls like scrollbars and check boxes, print 
graphics, support pop-up menus, and even support additional windows 
and menu bars. 

We'll start working on your Java skills right away-you won't need 
to wade through chapters of abstractions first. We will concentrate on 
examples, on seeing thingsirom the programmer's point of view-on 
seeing Java at work. 

Java programs come two ways: as stand-alone applications and as 
small programs you can embed in Web pages, called applets. Of the two, 
applets are the most popular, and we'll concentrate primarily on them. 

Building the Hello Example 

The first example will be a simple one because right now we just want to 
get you started in Java without too many extra details to weigh you down. 
You will create a small Java applet, the type of Java program you pan , . 
embed in a Web page, that will display the words "Hello from Java!" 



. What's an Applet? 

, , ■ you can embed in a Web page such that the applet gains control over a 
certain part of the Web page. On that part of the page, the applet can 

Each applet is given the amount of space (usually measured in pixels) 
that it requests in a Web page, such as the amount of space shown in Fig- 
ure 1.1. (Soon I'll show you how an applet "requests" space.) This is the : ,. 
space that the applet will use for its display. We'll place the words"HeIlo 
from Java! " in the applet, as shown in Figure 1.2. . 



FIGURE 1 .1 : An applet requests space in a Web page. 



In general, the name of the file will match exactly (including owe) the 
name given in the "class" statement in the file; in this case, that is 

hello: 

import java.awt. Graphics; 

public class hello extends java. applet. Applet 
^ public void paint( Graphics g ) 

1 g.drawString("Hello from Dava!\ 60, 30 ); 
) 

In this book, you will place your programs into subdirectories of a 

new directory called j aval-2 (^7^^^ f ^S 2 \ 
name). That means you'll save the hel 1 o . j ava file as c . \ j aval 2\ 

hell o\hello. java. 

Now you have created hel 1 b . j ava. This is ithe source codeforyour 
. applet, and it contains the Java code that you have written The next 
step is to compile this Java code into a working applet and see your 
applet at work. Applets have the extension . cl ass a ^ 
of your actual applet hell o, cl ass. I'll show you, why applets have the 
extension . cl as s shortly. 
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Now you'll use Java itself to create your ^PP*^ 

Java Development Kit (JDK) 1.2. . . . 

With previous versions of Java, you used to have to go through a rather 
lengthy arid involved installation process, but that's all changed now- 
Shavetorunan . EXE file. You get this .EXEfileonune from 
Kv//java.sun.com/products^ 
follow the instructions for installation. 

The next step is to make sure you can run the JDK from any location^ 
inyourcomputer^dudingthecAjaval-Zdirecte^ 
tories which is where you'll put your Java programs). To dp that, : make 
sme thl PATH statement in your AUTOEXEC . BAT file (found in the mam 
Sry of the C: drive) includes the JDK BIN and LIB directories (here 
I have installed the JDK in c : \ j dkl2— use whatever path is appropriate 
to the way you have installed the JDK): 

PATH-C : \WIND0WS ; C : \3DK12\BIN ; C : \3DK12\LIB 
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The JDK 1.2 is ready to go. 




^adocssJcu^ 

long filenames). ^ ™*>PPmg program must be able to handle 
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You'll need a Web browser to Innk atthoi„. j 

matted in HTML 6 ,3Va docume "tation because it's for- 
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this material laSZS Sir* ^ ^ a Iook at 

From Java 1.0 to Java 1 1 

Abstract WindowingToolMt enhancements Java 11 sun- 



i a r files jar (Java Archive) files were introduced in Java 
1 1 and let you package a number of files together zipping them 
to shrink them, so the user can download many files at once 
You can put many applets and the data they need together into 
one . j a? file, making downloading much faster. These files are 
analogous to .zip files except that your browser will download 
them and unzip them on-the-fly for you. 
Internationalization Java 1.1 lets you develop locale-specific 
applets, includingusing Unicode characters, a locale mecha- 
nism, localized message support, locale-sensitive date, time, 
' time zone, number handling, and more. 
Signed applets and digital signatures Java 1.1 can create 
digitally signed Java applications. A digital signature gives your 
users a "path" back to you in case something goes wrong. This 
is one of the new security precautions popular on the World 
Wide Web. 

Remotemethodinvocation InJava 1.1, RMI lets Java 
objects have their methods invoked from Java code ruiuung in 
other Java sessions. This is sort of similar to Local Kemote Pro- 
cedure Calls (LRPCs) - - - 

. Object serialization -S^^^m^^^^ 
let*™ 

out streams. Besides allowing you to store copies of the objects 
you serialize, serialization is also the basis of wromurucation 
>- between bbjecte 

' soft's Foundation Classes.' ; . ; . -^viX-^-ryr- 

Reflection InJava 1.1, reflection lets Ja^ code e^e infor- 
mation about the method and constructors of loaded classes 

o and make use <i^«P^J^^.^^^f^y'- 
Inner classes Java 1.1 makes it easier to.create adapter :- 
classes. An adapter class is a class that implements an interface 
required by an API (Applications Programming Interface). An 
adapter class "delegates" control back to an enclosing main 
object 

NewJavanativemethodinterface Native code is code that 
is written specifically for a particular machine. In Java 1.1, this 
interface was introduced to provide a standard programming 
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interface for writing Java native methods. The primary goal is 

virtual machme implementations on a given platform WriZ 
s^^ 

See Ud6d 3 P ° Werfil1 new Java nativ * method 

Byte, Short, and Void classes In Java 1.1, Byte and Short 
vaues can be handled as "wrapped" numbers whSy^u Se 
new Java classes Byte and Short. The new Voi d clSsTa 

Deprecated methods Quite a number of Java 1 0 methods 
aeprecated in the Java 1.1 documentation. (The Java comoiler 

in thp -iava n.t 8616016(1 BS I>style socket options 

I/Oenhancemente In Java 1.1, the I/O package was extended 
with character streams, which are like r^stSe^SS 
^rcontam 16*itI^code^ : 

^dependent of a specific chara^r encoding 4 SSbrT 

From Java 1.1 to Java 2 

Now let's have a look at what's new in Java 2. 

Securityenhaiicements When code is loaded, it is assigned 
permissions based on the seairitypoUtyanTCntiytaelcf 
Each permission specifies a permitted access toapa^Slt 



resource (such as "read" and "write" access to a specified file or 
directory, "connect" access to a given host and port, and so on). 
The policy, specifying which permissions are available for code 
from various signers/locations, can be initialized from an exter- 
nal configurable policy file. Unless a permission is explicit^ 
granted to code, it cannot access the resource that is guarded 
by that permission. 

Swing (JFC) Swing is the part of the Java Foundation Classes 
(JFC) that implements a new set of GUI components with a 
"pluggable" look and feel. Swing is implemented in pure Java, 
and Sbased on the JDK 1. 1 Light-weight UI Framework. The 
pluggable look and feel lets you design a single set of GUI com- 
ponents that can automatically have the look and feel of any 
platform (e.g., Windows, Solaris, Macintosh). 
Java 2D (JFC) The Java 2D API is a set of classes for 
advanced 2D graphics and imaging. It encompasses line art, 
text, and images in a single romprehensive model. 
Accessibility (JFC) Through the Java Accessibility API, 
developers will be able to create Java applications that can 
interact with assistive technologies such as screen readers, 

Wand Drop (JFC) D4 ^* ro i> ^ n ^^ tet ^ r 
. across both Java and native applications, between Java apphca- 

tions, and within a single Java application. 

'-CtoUectioi^ 
forrepresentmgandmampulatingJaVacoU^ 

you more about them later), aUowing them to be manipulated 
independent of the details . of ^5P^? n ^ on - ; ^ < 
Java extensions Framework Extensions are packages of Java 
classes (and any associated native code) that application devel- 
op can use to extend the core platto^ 
rdsm allows the Java Virtual Machine (JVM) to use the extension 
classes in much the same way it uses the system classes . 
JavaBeans enhancements Java 2 provides developers with 
standard means to create more sophisticated JavaBeans compo- 
nents and applications that offer their customers more seam- 
less integration with the rest of their runtime environment, 
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orot^ deSkt0P ° fthC ^^operatingsystemorthe 

X"?^ 0 * input method framework 
enables dltext-^dibng components to receive Japanese Chi- 
nese, or Korean text input through standard input Sods! 

r^ e , Ve, f i0nide,,tification "Versioning" introduces 
package level version control where applications and appkts 
can identify (at runtime) the version of a specific Java SSL 
Environment, VM, and class package. 3SpeCltlCjavaRuntime 

RMI enhancements Remote Method Invocation (RMI) has 

which induces supporter remote objects and automat* 
object activation, as well as Custom Socket Types Sow a 
remote object to specify the custom socket uSSSSSKSf 

such as SSL, can be supported using custom socket types.) 

API that j aUows the serialized data of an object to be specified 
^dependently of thefields of the class. Thiallo^ s SS 

* a PIDgram to a « an object 

that does inot prevent the object from being reclaimed^ 
Java V^cdl^'^^^^^^ ™ 

Audio enhancemente Au(ho enhancements include a new 
^engmeandsupportforaudi^ 

tobuted Web*nabled Java applications to invoke operations 
transparently on remote network services using the indX 
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standard OMG IDL (Object Management Group Interface Defi- 
nition Language) and HOP (Internet Inter-ORB Protocol) 
defined by the Object Management Group. 

JAR enhancements The enhancements include added func- 
tionality for the command-line JAR tool for creating and updat- 
ing signed JAR files. There are also new standard APIs for 
reading and writing JAR files. 

JNI enhancements The Java Native Interface (JNI) is a stan- 
dard programming interface for writing Java native methods 
and embedding the Java Virtual Machine into native applica- 
tions. The primary goal is binary compatibility of native method 
libraries across all Java Virtual Machine implementations on a 
given platform. Java 2 extends the Java Native Interface to incor- 
porate new features in the Java platform. 

JVMDI A new debugger interface, the Java Virtual Machine, 
now provides low-level services for debugging. The interface for 
these sendees is the Java Virtual Machine Debugger Interface 
(JVMDI). 

JDBC enhancements Java Database Connectivity (JDBC) is 
a standard SQL database access interface, providing uniform 
access to a wide range of relational databases. JDBC also pro- 
vides a c»mmon b&e^ interfaces 
can be built The Java 2 software bun^e indudes JDBC and the 
JDBOODBC bridge. 
,Ih^ re^y 

Compiling the Hello Applet 

Now that you have installed the JDK arid have your he! 1 o . j ava source 
file ready to go) you can create the artiial applet and see it run. To do this, 
change to the c : \ j aval-2\hel 1 o directory now (or wherever you have 
saved the hell o . j ava file); this is how the DOS prompt should look: 
c:\javal-2\he1 lo> 

Next, type this to create your applet: 
c:\javal-2\henojavache1To.java 
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that makes up your applet with JVM's class loader and then runs the 
applet 

Running the Hello Applet 

To see hell o . cl ass, your first applet, running, you'll need a Web page 
to place it in. Use your editor again to create a new file, hello, htm, 
which will be your Web page, written in the language of Web pages, 
Hypertext Markup Language (HTML) (we'll review HTML in a minute). 
Enter the following text into hello, htm and save it in the same direc- 
tory as the hel lo . cl ass file: 



<html> 

<!- Web page written for the Sun Applet Viewer> 
<head> 

<title>hel"lo</tit1e> 
</head> 



</body> ':,}■ V - 

- </html> ; ; : •. . 

Now you can run the hello applet by simply viewing this new Web 
page, hel 1 o ■ htm. To do that, use the Applet Viewer that comes with the 
JDK 1.2. To use the Applet Viewer, go back to the hell o subdirectory 

and type the following: 

c:\javal-2\henoappletviewerheno.htm 

Again capitalization is very important here-make sure your capitaliza- 
tion matches the exact spelling of the Web page name. When you've done 



<body> 
<hr> 




<applet . v , / 
, , ,code«hc^^o.cJas.s 



\. height^pQ> : ^ 




</applet> 
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this, the Applet Viewer runs, as shown below-and you see your message, 
"Hello.from Java! " Your first applet is a success. 




TIP 

You can use any Java-enabled Web browser to look at this Web page. For most 
of the applets in this book, however, you will have to use either a Web browser 
that supports Java 2 (not just Java 1.0 or Java 1.1) or the Sun Applet Viewer. 

Your first applet, he! 1 0 . cl ass, runs-but what exactly did you do? 
Let's take a look now at the Java code that you entered for he! 1 p . j ava, 
examining it line by line to get a better idea of how Java programming 
worl^ (wen tiipugh Jjaya w^l handle m^ for you later). 

Understanding the Hello Example 

I*t's take apart ypur.f^^^ 
■ ' . jo: ;import.^^airawt;6raphics^ ; ^" -^' : '^^^^h^^ ••• 



What does this mean? This line actually points out one of the great 
advantages of Java programming. When you're adding menus and sepa- 
rate windows to your Java applets/you can imagine that it would be a 
great deal of work to create everything from scratch-that is, write the 
entire code for menu handling, separate window creation, and so forth. 
Instead of asking you to do so, Java comes complete with several prede- 
fined libraries, and much of this book will be an examination of the rou- 
tines in these libraries. You'll learn more about this later, but what you're 
doing is adding support from the main Java graphics library of routines 



to your applet In this way, we'll be able to draw the text string, "Hello 
from Java!", in the applet's window. 




a C/ C++ programmer, you'll notice that the import statement works 
much like the C/C++ #include statement. 

Next, add these lines to hell o . j ava: 
import j ava. awt. Graphics; 

public class hello extends j ava. applet. Applet 



You ; vejust^ 

Object-Oriented Programming 

Obiects and dosses are two fundamental concepts ^ect^t^ 
^^WiOO?) and thatch make the whole topic seem mystenous 

to makelonger programs 

Understanding Java Objects 

Tn lone involved programs, there can be a profusion of both variables 

aramining was invented to break up such, large programs. 
The idea behind objects is^^ 

^rforming a discretetask, and those are yomob)ects For example, 
^ may^ all the screeri^andling parts of a program together mto 
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What'* a Java Class? 

But how do you create objects? That's where dtawcmtm.fc. a i • 
int the_dataj V 




For example, if you had set up a class named, say, g raphi csCl as «; 
you can create an object of that class named screen thTs w ' 
graphicsClass screen; 

nrlS'? S > n h ° W to actuaU J rcrea te a class soon (creating a class like 
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you'll see how to create objects of that class. What's important to remem- 
ber is this: the object itself is what holds the data you want to work with; 
the class itself holds no data but just describes how the object should be 
setup. 

Object-oriented programming at root is nothing more than a way of 
grouping functions and the data they work on together to make your pro 
gram less cluttered. You'll see more about object-oriented programming 
throughout this book, including how to create a class, how to create an 
object of that class, and how to reach the functions and data in that object 
when you want to. 

That completes the mini-overview of classes and objects. As you can 
see, a class is just a programming construct that groups together, or 
encapsulates, functions and data, and an object may be thought of as 
a variable of that class's type, as the object sc reen is to the class 
screenclass. 

As it turns out, Java comes complete with several libraries of prede- 
fined classes, which save you a great deal of work. Throughout this book, 
we will examine these predefined and very useful Java classes. Using 
these predefined classes, we'll create objects needed to handle buttons, 
text fields, scroll bars, and much more. 



Learning about Java Packages 



These class libraries are called packages in Java, and one such library is 
-'called j ava i awt (where awt stands for Abstract Window Toolkit). This 
-library holds the Graphi cs class, which will handle the graphics work 
you undertake. So this line in the hel 1 o - j ava file: 

import j ava. awt. Graphics; f 
actually means that you want to include the Java Graphics class and 
make use of it in your program. In a minute, you will use an object of the 
G raphi cs class for your graphics output 

You've added support for graphics handling by including the j ava 
. awt . G raphi cs class (and in Java, displaying the text string "Hello 
from Java! " is considered graphics handling). Next, it's time to set up 
your hello applet itself. To do so, define a new class named hel 1 o. 
This is the standard way of setting up an applet in Java, and in fact, the 
applet itself has the file extension . cl ass. That's because each class 
defined in a o ava file ends up being exported to a . cl as s file, where 
you can m.ake use of it. You'll learn more details about this soon. 
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It would be quite difficult to write all the code an applet class need* 
from scratch. For example, we'd need to interact with Te wt^tr 
reserve a secton of screen, initialize the appropriate Java pa2£TS 
much more, it hirns out that all that functionality is already bu£to the 
Java Appl et class, which is part of the j ava . appl et pacLe But h™ 

Understanding Java Inheritance 

You can customize the j ava . appl et . Appl et class by deriving the 
ava a tl S n th ! ja \ a 1 P Plet.Applet class. This make? 
IT* SSI ' ? PP l St baS€ of the hell o class, and it 
^^loaclas ^derived from java .applet .Applet. This gives 
ZJL P T r ° f * C jaVa - applet - A PP let dass ^out Se 
womes of wnting it yourself, and you can add what you want to this 
class by adding code to your derived class hello. wwtv>xtaa 

.^^^im^rtantpartofo 

^Jri^ m T 0n ^ 0{it ^^e^urnayhaveab^TcZ 
..caUed chassis. Yo^ 

tZt^' tBP ; and P ro 8 ram ^tically. Although thTcTL 
i^1 a fT Sh T the T ebaSeclass ' chassis, they addeddiW 
rtems to the base class, ending up as two quite different classes, car and 

Annfe 1 ? 6 ^ 06 ' m "to* base <*» Java . appl et 

derived w! 1 -° ' 3 ^ T**' mdicate *«* hell o class is 
m * e3 I ava - a PP let - A PPletdasslikethis(notethat y ou 
use the keyword class to indicate that you are defining a new class) 
import java. awt. Graphics; 



public class hello extends java. applet. Appl 
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mands) . But how do you make add hel , 0 dass? How 
java.applet.Appletj^^ 

do you display your text string? On ^^_°^ f object-oriented pro- 
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pai ntQ method like this: 
r import java.avrt.Graphics; 

public class hello extends java. applet. Applet 
^public void pai ntC Graphics g )C 
I 
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parts are called a class's members. 
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What Are Javp Access Modifiers? 

25 1 27 0I J 18 ."^ 311 «» *od(fcr. A class's methods can be 
declared publ i c, pn vate, or protected. If they are declared publ i c 

v^f ff ^ ^ Pn' vate, they may be called from 

SSSS?" m fl? m defined - * Protected, they may 

^^T riy . AedMSta ^* e y^ d ^^thedasses 
derived from that class. 

- ™ N S t 'J ndi ° ate 4116 ^ 171)6 ° f pai nt ° raethod - ^en you call 
a method, you can pass parameters to it, and it can return data to you 
In tins case, pai ntO has no return value, which you indicate with the 
return type void. Other return types are i nt for an integer return value 

ST. f IT*? 32 bitS long) ' 1 on 9 for a lon 8 «teg« (this vari- 
ably » usually 64 -bus long), fl oat for a floating pomtrSirn or double 

Sl£fa!S!S 6MUm g ^ ^ Y ° U ^ 3180 retUm 31X378 311(1 

Finally, note that you indicate that the pai nt() method is automati- 
cally passed one parameter-an object of the Graphi cs class called g: 
import java.avrt. Graphics; 

public class hello extends java.applet.Applet r 
public void paint(Graphics g) 



This Graphics object represents the physical display of the applet 
That k you can use the built-in methods of this object-such as draw- 

?1 9 C T ) ' + u raWLl ne °' dPaw0val °' «* others - to drawn on the 
screen. In this case, you want to place the string "Hello from Java!" on 
the screen, and you can do that with the drawStri ng() method. 

J^t^-J^J^ ^ ? ethods ofan ob i«* H« ^ Graphi cs object 
r^edg?Youdothatw 1 thadotoperator(01iketru S :g.drawString 
where here you are invoking g's drawStri ngQ method to "draw" a 

S7S2i ° n * C SCreen (teXt * h3ndled me ™y oth er type of graphics 

printed just as you would draw a rectangle or circle). Supply three param- 
eterstothedrawStringOme^^^ 

play, and the (x, y) location of that string's lower-left comer (called the 




It 

I'll 



starting point of the string's baseline) in pixels on the screen, passed I in 
^tateger values. As shown in Figure 1.3, you can drawyour string at the 
^enSon (60, 30), where (0, 0) is the upper-left corner of the applet s 
display. 




Thecoordinatesysteminajava program '"et up w,th the ongm 0,^ a the 

Snwards; this fact will be important throughout the book. If ^ffita* 
wards to vou vou might try thinking of it in terms of read.ng a Page of text, like 

ThisoneS 
do „n^ 

screen pixels. 



(0,0) x increases 
y increases 



(60,30) 




FIGURE 1.3: Drawing a string at (60,30) 

This means that you addacaUtothedrawStri ng() method thisway: 

import java.awt. Graphics; 

public class hello extends java. applet. Applet 



{ 



public void paintC Graphics g ) 

1 g.drawStringC "Hello from Java!", 60,30 ); 



Note that Java uses the same convention as C or C++ to indicate that a 
code statement is finished: it ends the statement with a semicolon (,). 




TIP 



In general, Java adheres very strongly toO+ coding conventions. Ifyou know 
C++, you already know a great deal of lava. 
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You have completed the code necessary for this applet, which is also 
to say you have completed the code for the new class, hello. When the 
Java compiler creates hel 1 o . cl ass, the entire specification of the new 
class will be in that file. This is the actual binary file that you upload to 
your Internet Service Provider so that it may be included in your Web 
page. A Java-enabled Web browser takes this class specification and cre- 
ates an object of that class and then gives it control to display itself and, 
if applicable, handle user input 

But how? You have not yet completed the dissection of the first example; 
all you have done so far is to trace the development of h e 11 o . j ava into 
hell o . cl ass. How did you get the applet to be displayed in the Applet 
Viewer? 

Understanding the Applet's Web Page 

The Applet Viewer took the hello, class applet and displayed it in a 
Web page, as shown in Figure 1.4. 




FIGURE 1.4: Displaying an applet in a Web page 

How did it get there? You created a Web page for your applet and then 
opened that Web page in the Applet Viewer, which then displayed your 
applet That Web page looks like this: 
<htm1> 



<!- Web page written for the Sun Applet Viewer> 
<head> 



<tit1e>hel1o</tit1e> 
</head> 



<body> 
<hr> 

<applet 

codeHiello. class 

width=200 

height=200> 

</appl et> 

<hr> 

</body> 

</htm1> 

Web pages are written in HTML (Hypertext Markup Language)^ 
BecauseappletsappearinWebpages,wewmtakethetetobnefly 

work through the above page to make sure you know what s going on. If 
you're familiar with HTML, you can skip much of this review, but you 
should take alook at how to use the <appl et> tag to embed applets in 
Web pages. 

Connecting Java and HTML 

Let's take apart the Web page you created for the applet now, starting 
with the <html> tag: 
<html> 



Instructions in .html pages are placed into tags surrounded by angle 
brackets: < and >. The tags hold directions to the Web browser and are 
not displayed on the screen. Here, the <h tml > tag mdicates to the Web 
browser that this . html file is written in HTML. 

Next comes a comment. Comments in . html pages are written using 
thelsymbollikethis: <! This is a comment.>. Indicate that tins is 
. aWebpagewrittensothatwecanusetheSunAppletViewer,likethis: 

<html> 



<!- Web page written for the Sun Applet Viewer> 
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end header tag </&ld>?^ iSS? Secbon ^ corresponding 
such as <head> and %%^J^?* m ^ in «*, ' 

text and images).Tn Se £ ^JT "? f /center > *> center 

<html> 

<!- Web page written for the Sun Applet Viewed 
<head> 

<trtle>hello</title> 
</head> 

a ruler line (™ible in Figure 1.4), »47e ?hr> * W '* h 

<htm1> * 

<!- Web page written for the Sun Applet Viewer> 
<head> 

<title>hello</title> 
</head> 

<body> 
<hr> 

Ported by the hell? cU« fikT ^""T * * sup- 
Web page written for the Sun Applet Viewer> 
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<head> 

<title>hello</title> 
</head> 

<body> 

<hr> 

<applet 

code=hello. class 
width=200 
height=200> 
</app1et> 



TIP 

Youcanalsousethejava.applet.Applet.resizeOmethodinyoursource 
code to request that the Web browser resize applets. 

The <appl et> tag is important, so let's take a closer look at it now. 
Here's how the <app1 et> tag works in general (the items in square 
brackets are optional, and the others are required): 

<APPLET> * 

[ALIGN = LEFT or RIGHT or TOP or TEXTTOP or MIDDLE or 

ABSMIDDLE or BASELINE or BOTTOM or ABSB0TT0M] 
[ALT = AlternateText] 
CODE = AppletName. class 
[CODEBASE = URL of .class file] 
HEIGHT = AppletPixelsHeight 
[HSPACE = Pi xelSpaceToLeftOf Applet] 
[NAME » AppletlnstanceName] 
[VSPACE = PixelSpaceAboveApplet] 
WIDTW = AppletPixelsWidth 

[<PARAM NAME = Parameterl VALUE « VALUE1] 
[<PARAM NAME = Parameter2 VALUE » VALUE2] 



</APPLET> 
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TIP 

This s often useful .f you want to store your applets together in a directory in 
your ISP, away from the . html files. y 

Indicate to the Web browser here how much space you'll need for your 
apple , usmg the HEIGHT and WIDTH keywords. You can also P as paZ- 
eters to applets with the PARAM keyword like this: <appl et> PARAM 
today = "friday- </app le t>. Passing parameters ^iswT 
allows you to customize your applets to fit different Web pages because 
you can read the parameters from inside an applet and make use of them. 



aO method in your source 
». 

loser look at it now. 
e items in square 



' or MIDDLE or 
•BOTTOM] 




TIP 

D2 s 1h?n a e ^ eme, )!f t0 ^ <aPPl 6t> ta * ln Java 2 « such * th * ability to 
pass the name of .jarfiles as parameters. You'll learn more about this lateron. 

h J^*" ™^ SUPP0rt Java - In P ractice ' means that those 

browsers just ignore the <appl et> tag. This, in turn, means that you can 
Plaatextbetweenthe^ppleo and </a PP let> tags that willbeST 
played m non-Java browsers (and not in Java^nabled browsers), like this- 
<applet code=hello> 

i^x^r does not support 3ava - s ° y ° u « * 

</applet> 

Using the <appl et> tag, you can embed applets in Web pages, as 
Java has done in this temporary page. Finish off the Web page with the 
</body> and </h tml > tags as Mows: 
<html> 

<!- Web page written for the Sun Applet Viewer> 
<head> 

<title>heno</title> 
</head> 



<body> 
<hr> 



<applet 

code=hello. class 
width=200 
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height=200> 

</applet> 

<hr> 

</body> 

</html> 

This completes our first example-you've had a glimpse into the 
process of creating and running an applet. It was as quick and easy as 
that-you created and ran your first applet 

What's Next? 

In this chapter, the example applet demonstrated the easiest way to get 
an applet to work. Let's continue on to get a better idea of how you'll be 
working with Java throughout the book as you give your applet more 
power in Chapter 2. 
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Abstract 

This paper describes a new public-key cryptosystem based 
on the hardness of computing higher residues modulo a com- 
posite RSA integer. We introduce two versions of our scheme, 
one deterministic and the other probabilistic. The determin- 
istic version is practically oriented: encryption amounts to 
a single exponentiation w.r.t. a modulus with at least 768 
bits and a 160-bit exponent. Decryption can be suitably op- 
timized so as to become less demanding than a couple RSA 
decryptions. Although slower than RSA, the new scheme is 
still reasonably competitive and has several specific appli- 
cations. The probabilistic version exhibits an homomorphic 
encryption scheme whose expansion rate is much better than 
previously proposed such systems. Furthermore, it has se- 
mantic security, relative to the hardness of computing higher 
residues for suitable moduli. 

1 Introduction 

It is striking to observe that two decades after the discovery 
of public- key cryptography, the cryptographer's toolbox still 
contains very few asymmetric encryption schemes. Conse- 
quently, the search for new public-key mechanisms remains 
a major challenge. The quest appears sometimes hopeless 
as new schemes are immediately broken or, if they survive, 
are compared with RSA, which is obviously elegant, simple 
and efRcient. 

Similar investigations have been relatively successful in 
the related setting of identification, where a user attempts 
to convince another entity of his identity by means of an on- 
line communication. For example, there have been several 
attempts to build identification protocols based on simple 
operations (see [33, 35, 36, 26]). Although the question of 
devising new public-key cryptosystems appears much more 
difficult (since it deals with trapdoor functions rather than 
simple one-way functions), we feel that research in this di- 
rection is still in order: simple yet efficient constructions 
may have been overlooked. 

The scheme that we propose in the present paper uses 
an RSA integer n which is a product of two primes p and 4, 

Permission to make digital or hard copies of all or part of this work for 
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are not made or distributed for profit or commercial advantage and that 
copies bear this notice and the full citation on the first page. To copy 
otherwise, to republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee. 
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as usual. However, it is quite different from RSA in many 

respects: 

1. it encrypts messages by exponentiating them with re- 
spect to a fixed base rather than by raising them to a 
fixed power 

2. it uses a different "trapdoor** for decryption 

3. its strength is not directly related to the strength of 
RSA 

4. it exhibits further "algebraic" properties that may prove 
useful in some applications. 

We briefly comment on those differences. The first one may 
offer a competitive advantage in environments where a large 
amount of memory is available: such environments allow 
impressive speed-ups in exponentiations that do not have 
analogous counterparts in RSA-like operations. The second 
is of obvious interest in view of the fact quoted above that 
there are very few public-key cryptosystems available. With- 
out going into technical details at this point, let us simply 
mention that the new trapdoor is obtained by injecting small 
prime factors in p — 1 and q — 1. In order to understand what 
the third difference is, we note that, if the modulus n can 
be factored, then both RSA and the proposed cryptosystem 
are broken. However, it is an open problem whether or not 
RSA is "equivalent 1 ' to factoring, which would mean that 
breaking RSA allows to factor. For this reason, the hypoth- 
esis that RSA is secure has become an assumption of its 
own, formally stronger than factoring. Our cryptosystem is 
related to another hypothesis, also formally stronger than 
factoring and known as the higher residuosity assumption. 
This may help to understand how these various hypotheses 
are related. Finally, we will explain the algebraic property 
of our scheme (called the homomorphic property) by means 
of an example: suppose that one wishes to withdraw a small 
amount u from the balance m of some account; assume fur- 
ther that the balance is given in encrypted form E(m) and 
that the clerk performing the operation does not have access 
to decryption. The cryptosystem that we propose simply 
solves the problem by computing E(m)/E(u) mod n, which 
turns out to be the encryption of the new balance m — u. 

The ability to perform algebraic operations such as addi- 
tions or subtractions by playing only with the cryptograms 
has potential applications in several contexts. We quote a 
few: 

1. in election schemes, it provides a tool to obtain the 
tally without decrypting the individual votes (see [4]) 



2. in the area of watermarking, it allows to add a mark 
to previously encrypted data (as explained in [25]). 

Still, in these contexts, it is often needed to encrypt data 
taken from a small set S (e.g. 0/1 votes) and it is well known 
that deterministic cryptosystems, such as RSA, rail here: in 
order to decrypt E(a) y one can simply compare the cipher- 
text with the encryptions of all members of S and thus find 
the correct value of a. In order to overcome the difficulty, 
one has to use probabilistic encryption, where each plaintext 
has many corresponding ciphertexts, depending on some ad- 
ditional random parameter chosen at encryption time. Such 
a scheme should make it impossible to distinguish encryp- 
tions of distinct values, even if these are restricted to range 
over a set with only two elements. This very strong require- 
ment has been termed semantic security ([12]). As a further 
difference with RSA, the cryptosystem introduced in this 
paper, has a very natural probabilistic version, with proven 
semantic security. 

The probabilistic homomorphic encryption schemes pro- 
posed so far suffer from a serious drawback: they have very 
poor bandwith. Typically, they need something like one kilo- 
bit to encrypt just a few bits, which is a quite severe expan- 
sion rate. This may be acceptable for election schemes but 
definitely hampers other applications. The main achieve- 
ment of the present paper is to reach a significant band- 
with, while keeping the other properties, including semantic 
security. 

Before we turn to the more technical developments of our 
paper, it is in order to compare it with earlier work: it is in- 
deed the case that the question of finding trapdoors for the 
discrete logarithm problem has been the subject of many pa- 
pers. At this point, it is fair to mention that the probabilistic 
cryptosystem that we propose is actually quite close to the 
most general case of the homomorphic encryption schemes 
introduced by Benaloh in his Ph-D thesis [4]. Still, both in 
this thesis and in the related work ([5, 6, 7]), the security 
and potential applications are only investigated in a setting 
where the bandwith remains small. A more recent paper 
by Park and Won (see [24])describes a related probabilistic 
cryptosystem using a trapdoor based on injecting a single 
power of a small odd integer into p — 1 or q — 1 and proves 
its security with respect to an ad hoc statement. Thus, our 
paper offers the first thorough discussion of the security of 
a probabilistic homomorphic encryption scheme with signif- 
icant bandwith. After the completion of the present work, 
we have been informed that another homomorphic proba- 
bilistic encryption scheme, using moduli n of the form p 2 q, 
where p and q are primes, had been found by Okamoto and 
Uchiyama (see [22]), achieving an expansion rate similar to 
ours. Finally, it should be emphasized that the determinis- 
tic version of our scheme is not simply a twist that fixes the 
random string in the probabilistic version: considering its 
practicality, we believe that, even if it is not intended to be 
a direct competitor to RSA, it enters the very limited list of 
efficient public-key cryptosystems. 

The paper is organized as follows: in the next two sec- 
tions, we successively describe the deterministic and the 
probabilistic version of our scheme, the former with a prac- 
tical approach, the latter in a more complexity-theoretic 
spirit. We then discuss applications and end up with a chal- 
lenge for the research community. 

2 The deterministic version 

As was just mentioned, our approach to the deterministic 
scheme is practically oriented: we discuss system set-up 



and key-generation, encryption and decryption, with per- 
formances in mind. We also carry on a security analysis at 
the informal level and we derive minimal sugested parame- 
ters. 

2.1 System set-up and key generation 

The scheme that we propose in the present paper can be 
described as follows: let a be a squarefree odd B-smooth 
integer, where B is small integer and let n = pq be an RSA 
modulus such that a divides <f>(n) and is prime to <P(n)fa. 
Typically, we think of B as being a 10 bit integer and we 
consider n to be at least 768 bits long. Let g be an element 
whose multiplicative order modulo n is a -large multiple of 
<r. Publish n, g and keep p, q and optionally a secret. A 
message m smaller than a is encrypted by g m mod n; de- 
cryption is performed using the prime factors of a as will be 
seen in the next subsection. 

Generation of the modulus appears rather straightfor- 
ward: pick a family pi of k small odd distinct primes, with 

k even. Set tt = YliaiP** v ~ n*/2+i Pi a = uv = 
n*=iP«' tw0 primes a and 6 such that both 

p = 2au + 1 and q = 2bv + 1 are prime and let n = pq. 

However, this generation is lengthy especially when the 
size of the modulus grows: a has to be chosen in the appro- 
priate range and tested for primality as well as p = 2au + 1 
until both tests succeed simultaneously. This might be a bit 
time-consuming. Instead, we suggest to generate a, 6, u and 
v first (independently of any primality requirements on p 
and q) and use a couple of 24-bit "tuning primes" p' and q' 
(not used in the encryption process) such that p = 2attp' + 1 
and q = 2bvq' + 1 are primes. To avoid interferences with 
the encryption mechanics, we recommend to make sure that 
gcd (pV,<r) = 1 and p' £ q*. In practice, such an approach 
is only 9% slower than equivalent-size RSA key-generation. 

To select g t one can choose it at random and check 
whether or it has order <f>(n)/4. The main point is to ensure 
that g is not a p,-th power, for each t < k by testing that 

g Pi £ 1 mod n. The success probability is : 

* 1 * 1 

7r = TT(1 ), whose logarithm is : ln(?r) ~ - — 

t=i »=i 

If the PtS are the first k primes, this in turn can be estimated 
as — in In and results in the quite acceptable overall prob- 
ability of 7r 1/ In fc. Another method consists in choosing, 
for each index i < k, a random <?,- until it is not a p,-th 
power. With overwhelming probability g = [J* ol g* /Pi has 
order > #(n)/4. 

2.2 Encryption and Decryption 

Encryption consists in a single modular exponentiation: a 
message m smaller that a is encrypted by g m mod n. Note 
that it does not require knowledge of a. A lower bound 
(preferably a power of two) is enough but it is unclear how 
important for the security of the scheme is keeping a se- 
cret. However, if one chooses to keep a secret, necessary 
precautions (similar to these applied to Rabin's scheme [31] 
or Shamir's RSA for paranoids [34]) should be enforced for 
not being used as an oracle 1 . 

l For example, an attacker having access to a decryption box can 
decrypt g m mod n for some m > <r and get m mod a. This discloses 
(by subtraction) a multiple of <r and <r can then be found by a few re- 
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Also, there is actually no reason why the p*s should be 
prime. Everything goes through, mutatis mutandis, as soon 
as the pis are mutually prime. Thus, for example, they can 
be chosen as prime powers, which is a way to increase the 
variability of the scheme. 

Decryption is based on the Chinese remainder theorem. 
Let pi, 1 < i < &, be the prime factors of ov The algorithm 
computes the value m% of m modulo each p,* and gets the 
result by Chinese remaindering, following an idea which goes 
back to the Pohlig-Hellman paper [27]. In order to find m,*, 
given the ciphertext c = g m mod n, the algorithm computes 

a = c mod n, which is exactly g « mod n. This fol- 
lows from the following easy computations, where t/» stands 
former*: 



c* = c *>» = g 



(m.4y.'Pt)^(") 

= 5 Pi 



! 0 



« modn 



By comparing this result with all possible powers g , it 
finds out the correct value of m,-. In other words, one loops 

for j = 0 to pi - 1 until a = g p * mod n. 

The cleartext m can therefore be computed by the fol- 
lowing procedure : 

for i=l to k 
{ 

let a ss C * in)/Pi mod n 
for j = 0 to pi — 1 
{if a — gt*W/n mod n let mi = j} 

> 

x = ChineseRemainder({7n t -}, {pi}) 

The basic operation used by this (non-optimized) algo- 
rithm is a modular exponentiation of complexity log 3 (n), 
repeated less than : 

kpk < log(n) pjt S log(n) k log(fc) < log 2 (n) logiog(n) 

times. Decryption therefore takes log 5 (n)loglog(n) bit op- 
erations. 

This is clearly worse than the log 3 (n) complexity of RSA 
but encryption can be optimized if a table stores all possi- 

ble values of t[i,j] = g « , for 1 < t < k and 1 < j < i: 
the value m, of the cleartext m modulo pi is found by ta- 

ble look-up, once c « mod n has been computed. It is not 

really necessary to store all g « . Any hash function that 

distinguishes g « from 5 « , for j ^ j will do and, in 
practical terms, a few bytes will be enough, for example ap- 
proximately 2\p t \ bits from each t[i t j]. It is even possible to 

use hash functions that do not cUscriminate values of g Pi ; 
the proper one is spotted by considering, by table look-up 

peated trials and gcds. To prevent such an action, the decryption box 
cannot only re-encrypt and check against the ciphertext received, as 
this allows a search by dichotomy. It should first check that the clear- 
text is in the appropriate range, e.g. < 2* with 2* < m, re-encrypt it 
and then check that it matches up with the original ciphertext before 
letting anything out. 



hashes of g Pi , for £ = 1, 2, • • • until there is no ambi- 
guity. This can be very efficiently implemented by storing 
hash values in increasing order wx.t. t and one single bit 
might be enough. 

2.3 A toy example 

• key generation for k = 6 

p 21211 = 2 x 101 x 3 x 5 x 7+ 1, 

q = 928643 = 2 x 191 x 11 x 13 x 17 + 1, 

n = 21211 x 928643 = 19697446673 and g = 131 yield the 
table: 





t = l 


i = 2 


t = 3 


Lt = 4 


t = 5 


1 = 6 


j= 0 


0001 


0001 


0001 


0001 


0001 


0001 


j= 1 


1966 


6544 


1967 


6273 


6043 


0372 


i= 2 


9560 


3339 


4968 


7876 


4792 


7757 


3= 3 




9400 


1765 


8720 


0262 


3397 






5479 


6701 


7994 


0136 


0702 


i= 5 






6488 


8651 


6291 


4586 


i= 6 






2782 


4691 


0677 


8135 


j= 7 








9489 


1890 


3902 


i = 8 








8537 


6878 


5930 


j = 9 








2312 


2571 


6399 


i-io 








7707 


7180 


6592 


j = 11 










8291 


9771 


j = 12 










0678 


0609 


i = i3 












7337 


j-U 












6892 


j = 15 












3370 


j = 16 












3489 



where entry {ij} contains 0**<*>/« modn mod 10000. 
• encryption of m = 202 



c = g m mod n « 131 202 mod 19697446673 = 519690214 

• decryption 
by exponentiation, we retrieve : 



*(») 

C "I 


modn mod 10000 




1966 


*(n) 

C Pa 


modn mod 10000 




3339 


*(n) 
C P3 


modn mod 10000 




2782 


C P4 


modn mod 10000 




7994 


«(»») 
C P6 


modn mod 10000 




1890 


4(h) 

c p« 


modn mod 10000 




3370 



wherefrom, by table lookup : 

m mod 3 = table (1966) = 1 
m mod 5 = table (3339) = 2 
m mod 7 = table (2782) = 6 
m mod 11 = table (7994) = 4 
m mod 13 ~ table (1890) = 7 
m mod 17 = table (3370) = 15 

and by Chinese remaindering : m = 202. 
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2.4 Suggested parameters and security analysis 

We suggest to take a > 2 160 and we consider |n| = 768 bits 
as a minimum size for the modulus. 

If the factorization of n is found, then a and b become 
known as well as <p(n). The scheme is therefore broken. 
However, the scheme does not appear to be provably equiv- 
alent to factoring. Rather, it is related to the question of 
having oracles that decide whether or not a random num- 
ber x is a p,-th power modulo n, for i = 1, . . . , k. This is 
known as the higher residuosity problem and is currently 
considered unfeasible. Formal equivalence of this problem 
and the probabilistic version of our encryption scheme will 
be proved in the next session. Considering the basic deter- 
ministic version, we have no formal proof but we haven't 
found any plausible line of attack either. Also, the efficient 
factoring methods such as the quadratic sieve (QS) or the 
number field sieve (NFS) do not appear to take any advan- 
tage from the side information that u (resp. v) divides p—l 
(resp. q — 1). The same is true of simpler methods like 
Pollard's p—l since we have ensured that neither p—l nor 
q — 1 is smooth. Finally, elliptic curve weaponry [18} will 
not pull-out factors of n in the range considered. Note that 
the requested size of n (768 bits or more) makes factoring n 
a very hard task anyway. 

We now turn the size of a. In order to avoid the com- 
putation of discrete logarithms by the baby step-giant step 
method, we have to make tr large enough. As already stated, 
2 160 is a minimum. This can be achieved for example by 
making a a permutation of the first 30 odd primes, which 
yields a 2* 60 ' 45 . Alternatively, one one can choose a se- 
quence of 16 primes with 10 bits. Since there are 75 such 
primes, this leads to a = 58-bit entropy. Adding prime pow- 
ers, as stated above, will further increase these figures. 

There is a further difficulty, when <r is known. Note that 

4a6 = * (n) = n-P~q + l 

hence 4ab differs from £ only by e = — p ^%~ 1 . The nu- 
merator is of size |n|/2, hence, if it does not exceed the 
denominator by a fairly large number of bits, the value of 
ab is basically known and decryption can be performed. 

When the exact splitting of the factors of a into u and 
v are known as well, the previous analysis can be pushed 
further. Reducing the relation n = (2ati+l)(2ov+l) modulo 
u, we find that n = 2bv + 1 mod u and we can calculate 
d = b mod u. Similarly, we learn c — a mod v. We let 
o = rv + c and b = su + d 9 with r $ s unknown and, using 
the fact that a = tit;, we obtain: 

n = (2rvu + 2cu + l)(2suv + 2dv + 1) = 

4rs<r 2 + 2<x[r{2dv + 1) + s(2cu + 1)] + (2cu -h l)(2dv + 1) 
which is of the form 

n = 4rsa 2 + 2a(atr 4- ps) + 7 

with known a, 0 and 7. Reducing modulo a 2 , this provides 
the value 5 of ar+0s mod a. At this point, our analysis be- 
comes quite technical and the reader may skip the following 
and jump to the conclusion that n » a 4 . 

For the interested reader, we note that the pair (r, s) lies 
in the two-dimensional lattice L defined by 

L = {(x, y)\ax + 0y — 6 mod a} 



This lattice has determinant o\ Also, it is easily seen that 
a and 0 are bounded by 2<r and 7 by 4a 2 . From this we get 

rs< ^ < r * + r + s + l = (r+l)(s + l) 

Thus, the pair (r t s) is very close to the boundary of the 
curve C with equation xy — —y. More precisely, the dis- 
tance between the pair (r, s) and the curve does not exceed 
y/2. This defines a geometric area A that includes (r, a). 
Now, key generation usually induces constraints that limit 
the possible range of the parameters. For this reason, it is 
appropriate to replace C by the line x + y = ^ in order 
to estimate the size of A. This leads to an approximation 
which is O(^). The number of lattice points from L in 
this area is, in turn, measured by the ratio between the size 
of A and the determinant, which is It is safe to ensure 
that this set is beyond exhaustive search, which we express 
by n » <r 4 . 

Note that the ratio \n\f\a\ is the expansion rate of the 
encryption, where |n| denotes, as usual, the size of n in bits. 
It is of course desirable to make this rate as low as possible. 
On the other hand, as a consequence of the above remarks, 
we see that ^ — |a| should be large. Asymptotically, this 
is achieved as soon as we fix an expansion rate which is 
> 4. For real-size parameters, we suggest to respect the 
heuristic bound ^ - a > 128, which is consistent with 
our minimal parameters. Larger parameters allow a slightly 
better expansion rate. 

2.5 Performances 

Despite its expansion rate, the new cryptosystem is quite ef- 
ficient: encryption requires the elevation of a constant 768- 
bit number to a 160-bit power. Several batch ([21, 23]) and 
pre-processing ([2]) techniques can speed-up such computa- 
tions, which might be a small advantage over RSA. 

Decryption is slightly more awkward since k exponenti- 
ations are needed. But this number can be reduced in a few 
ways : 

Firstly, while computing c** n)/p^ mod n for each *, it 
is possible to first store c' = c 4ab mod n and raise c' to 
the successive powers <x/pi so that (besides the first one), 
the remaining exponentiations involve 160-bit powers. One 
can farther, in the square-and-multiply algorithm, share 
the "square" part of the various exponentiations. A care- 
ful bookkeeping of the number of modular multiplications 
obtained by setting |n| = 768 and choosing sixteen 10-bit 
primes pi, shows that the total number of modular multi- 
plications decreases to 2352: 912 for the computation of c' 
and 1440 for the rest Actually, the "multiply" part can be 
somehow amortized as well: we refer to [21] for a proper de- 
scription of such an optimized exponentiation strategy. The 
resulting computing load is less than what is needed for a 
couple of RSA decryptions with a similar modulus. 

Unfortunately, there is a drawback in reducing the value 
of k: in the 30-prime variant it is necessary to store 1718 
different t[i,j] hash values. Hashing on two bytes seems 
enough and results in an overall memory requirement of four 
kilobytes. In the 16-prime variant, hash values of 3 bytes 
are necessary and the table size becomes ¥ 100 kilobytes. 
As observed at the end of section 2.2, the hash table can 
be drastically reduced at the cost of a minute computation 
overhead. 

Another speed-up can be obtained by separately per- 
forming decryption modulo p and q so as to take advantage 
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of smaller operand sizes* This alone, divides the decryption 
workfactor by four. 

Finally, decryption is inherently parallel and naturally 
adapted to array processors since each mi can be computed 
independently of all the others. 

2.6 Implementation 

The new scheme (768-bit n, k = 30) was actually imple- 
mented on a 68HC05-based ST16CF54 smart-card (4,096 
EEPROM bytes, 16,384 ROM bytes and 352 RAM bytes). 
The public key is only 96-byte long and as in most smart- 
card implementations, n's storage is avoided by a command 
that re-computes the modulus from its factors upon request 
(re-computation and transmission take 10 ms). For further 
space optimization g's first 91 bytes are the byte-reversed 
binary complement of n's last 91 bytes. Decryption (a 4,119- 
byte routine) takes 3,912 ms. Benchmarks were done with 
a 5 MHz oscillator and ISO 7816-3 T=0 transmission at 
115,200 bauds. 

3 The probabilistic version 

3.1 The setting 

We now turn to the probabilistic version of the scheme. As 
already explained, we adopt a more complexity-oriented ap- 
proach and, for example, we view B as bounded by a polyno- 
mial in log 9i. The probabilistic version replaces the cipher- 
text g m mod n by c = o m mod n, where x is chosen at 
random among positive integers < n. Decryption remains 
identical. This is due to the fact that the effect of multi- 
plying by x* is cancelled by raising the ciphertext to the 
various powers as performed by the decryption algo- 
rithm. Note that this version requires a to be public. 

The resulting scheme is homomorphic, which means that 
E[m + m' mod a) — E(m)E(m') mod n. Probabilistic ho- 
momorphic encryption has received a lot of applications, 
both practically and theoretically oriented. To name a few, 
we quote the early work of Benaloh on election schemes ([4]) 
and the area of zero-knowledge proofs for NP (see [13, 3]). 
Known such schemes are the Quadratic Residuosity schemes 
of Goldwasser and Micali ([12]) which encrypts only one bit 
and its extensions to higher residues modulo a single prime 
(see [4]), which encrypts a few bits. As already explained 
in section 1, these schemes suffer from a serious drawback: 
a complexity theoretic analysis has to view the cieartext as 
logarithmic in the size of of ciphertext. In other words, 
the expansion rate, i.e. the ratio between the length of 
the ciphertext and the length of the cieartext is huge. In 
our proposal, this ratio is exactly Note that that our 
assumption that a is B-smooth, for some small B, does 
not preclude a linear ratio. The maximum size of a is 
52 p<B logi>, where p ranges over primes and it is known 
that 6(B) = ]£ p <b lnp~£. Thus, even if B is logarithmic 
in n, there are enough primes to make \a\ a linear propor- 
tion of \n\. This is a definite improvement over previous 
homomorphic schemes. Note however that, following the 
comments in section 2.4, it is safe to take < 1/4. 

3.2 A complexity theoretic approach 

We already observed that the security of our proposal is re- 
lated to the question of distinguishing higher residues mod- 
ulo n, that is integers of the form x p mod n, when p is a 



prime divisor of <f>(n). La the rest of this section, we want 
to clarify this relationship in the asymptotic setting of com- 
plexity theory. In view of the remarks just made, we find 
it convenient to assume that the ratio has a fixed value 
a < 1/4. We also fix a polynomial B in logn. The parame- 
ters which are of interest to us are pairs (n, a) such that a is 
squarefree, odd and B-smooth, n is a product of two primes 
p t q 9 a is a divisor of #(n) prime to #(n)/<r and |£[ = a. We 
call any integer n that appears as first coordinate of such 
a pair (£, a)-dense. Distinguishing higher residues is usu- 
ally considered difficult (see [4]). We conjecture that this 
remains true when n varies over (£,a)-dense integers. To- 
wards a more precise statement, let Rp(y % n) be one if y is a 
p-th residue modulo n and zero otherwise. Define a higher 
residue oracle to be a probabilistic polynomial time algo- 
rithm A which takes as input a triple (n, y,p) and returns a 
bit A(n,y,p) such that the following holds: 
There exists a polynomial Q in |n| such that, for infinitely 
many values of \n\, one can End a prime p(\n\) < B, with: 

Fr{A(n,y,p) = ^(y.n)} > 1 - i + i 

where the probability is taken over the random tosses of A 
and its inputs, conditionnally to the event that n is (B,a)- 
dense and p is a divisor of <£(n). 

Our Intractability Hypothesis is that there is no higher 
residue oracle. The constant 1 - » comes from the obvious 
stategy for approximating Rp which consists in constantly 
outputting zero. This strategy is successful for a proportion 
1 — ~ of the inputs. 

3.3 A security proof 

The security of probabilistic encryption scheme has been 
investigated in [12]. In this paper, the authors introduced 
the notion of semantic security, given two messages mo and 
mi, a message distinguisher is a probabilistic polynomial 
time algorithm D } which distinguishes encryptions of mo 
from encryptions of mi. More, accurately, it outputs a bit 
JD(n, cr, p t y) in such a way that, setting 

ft = Pr{D{n,<r,g % y) = l\y € E(m,)} 

where E(rm) is the set of encryptions of m,*, the following 
holds: 

There exists a polynomial Q in \n\ such that, for infinitely 
many values of\n\, \Bo — 0\\ > ^ 

Semantic security is the assertion that there is no pair of 
polynomial time algorithms F, D such that F produces two 
messages for which D is a message distinguisher. 

Theorem 1 Assume that no higher residue oracle exists. 
Then, the probabilitic version of the encryption scheme has 
semantic security. 

The proof of this result uses the hybrid technique for which 
we refer to [11]. It is technical in character and we have 
chosen to only include a sketch it in an appendix to the 
present paper. 

4 Applications and variants 

Even if we do not expect large scale replacement of RSA by 
our scheme, we feel that the latter is worth some academic 
interest. Especially, we believe that it opens up new appli- 
cations. We have not yet fully investigated those potential 
applications but we give some suggestions below. 
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4.1 Traceability 

Our proposal could offer some help in the management of 
key escrowing sendees. Consider the variant of the Diffie- 
Hellman key exchange protocol, where a composite modulus 
n is used. Such a variant has been studied by various re- 
searchers including Mc Curley in [20], where it is shown that 
some specific choices lead to a scheme that is at least as dif- 
ficult as factoring. Assume further that the modulus n and 
the base for exponentiations g are chosen as described in sec- 
tion 1. It has been proposed (see e.g [14}) that g and n could 
be defined by some kind of TTP (Trusted Third Party). 
Now, the user's public key y and his secret key x are related 
by y = g* mod n. It is conceivable to leave the choice of x 
to the user with the provision that x mod a — ID 7 where 
ID is the identity of the user. This can be checked by the 
TTP upon registration of the key. Thus, we have reached a 
situation where the identity is embedded in the public key 
through a trapdoor, although the actual key is not. One 
should not however overestimate the resulting functionality. 
It could be useful in scenarios where traceability is made 
possible via escrowing but where confidentiality cannot be 
broken even with the help of the escrowing services. Al- 
ternatively, it might be used to split traceability and secret 
key recovery between key escrows. Note that the above pro- 
posal requests that a is made public: as already observed, 
this does not seem to endanger the scheme. 

4.2 Variants of the scheme 

As is often the case, one can design numerous variants of 
the basic scheme. We will mention two because of their 
potential applications. 

Use of moduli with three prime factors As for RSA, it is 
possible to embed three prime factors p 7 q, r in the modu- 
lus in place of two. The construction is straightforward: the 
small odd primes p t are split into three groups thus yielding, 
by multiplication, three integers u, v t w. The three primes 
are then sought among integers of the form 2au + 1 (resp. 
2bv + 1, resp. 2ciu + 1). It seems possible to keep the mini- 
mum size of n to 768 bits, which allows a, 6, c to be around 
200 bits. Following an idea of Maurer and Yacobi ([19]), we 
can then have a complete trapdoor for the discrete logarithm 
with base g: once the <r part has been computed, there re- 
mains to compute the logarithm modulo a, b and c, which 
is not immediate but well within the reach of current tech- 
nology, since these numbers are 200 bit integers. Again, the 
variant could prove useful in key escrowing scenarios of, say, 
Diffie-Hellman keys, where it might be desirable to have a 
lengthy recovery of the secret key for consumer's protection. 

Multiplicative encryption In this variant, a is made pub- 
lic and encryption applies to messages of length k y m = 
Y^=i to,2 ,_i . In order to encrypt m, one computes e = 

il i=i Pt 1 a PP*y probabilistic encryption to e. Of course, 
the bandwith of this variant is very low: using a 768 bit mod- 
ulus n and choosing the first 30 odd primes for p<s, we obtain 
a 30 bit input and a 768 bit output. Allowing a larger input 
has drastic consequences in terms of the size of n. The value 
of a is close to 2* 60 when the first k primes are used with 
fc = 80 but reaches 2 998 A for k = 128 and 2 1309 for k = 160. 
Using the heuristic bound mentioned in section 2.4, we get 
for the length of n something beyond 5000 bits if k is 160. 
This goes down to 2400 bits when k = 80. 



As a result, the variant just decribed is not really prac- 
tical and there is little chance that it can ever be adopted 
as an actual encryption scheme. On the other hand, the ci- 
phertext c(m) can be used in an encryption scheme k la El 
Gamal. The modulus is not prime since it is an RSA mod- 
ulus, but it makes no difference on the user's size. From 
h = c(m), he can manufacture a public key y with a corre- 
sponding matching secret key x of his choice y = h x mod n 
The resulting cryptosystem allows ciphertext traceability in 
the sense of Desmedt (see [9]). Our proposal enables to 
trace ciphertexts by a technique similar to the one used by 
Desmedt, but decreases the size of the modulus from some- 
thing like 10000 bits to 2500 bits. The tracing algorithm 
goes as follows: extract from an El Gamal encryption the 
part u — h T mod n and apply the decryption algorithm, 
treating u as a ciphertext. The decryption algorithm will 
basically find the original message m, which provides the 
identity of the user and from which h was built. Several 
errors may occur due to the fact that r might have some 
of the piS as divisors: the corresponding decrypted values 
of m> will be set to 1, regardless of their original values. 
The correct value can be found if a sample of ciphertexts 
are available or, alternatively, if an error-correction capacity 
has been added to m. Such an error-correction mechanism 
is highly advisable anyway in view of the attacks against 
software key escrow reported in [15]. 

Note that, one can further reduce the size of the expo- 
nent. This is because 40 bits may be considered enough 
for tracing purposes. The value of <r goes down to approxi- 
mately 2 2 " 3 and 1088 bits becomes an acceptable minimum 
length for the modulus. 

5 Challenge 

It is a tradition in the cryptographic community to offer 
cash rewards for successful cryptanalysis. More than a sim- 
ple motivation means, such rewards also express the design- 
ers' confidence in their own schemes. As an incentive to 
the analysis of the new scheme, we therefore offer $ |n| to 
whoever will decrypt : 

c = 13370f e62d81f de356dl842fd7e5f Clae5b9b449 
bdd00866597e61af4fb0d939283b04d3bb73f91f 
0d9d61eb0014690e567ab89aa8df4a9164cd4c6e 
6df80806c7cdceda5cfda97bf7c42cc702512a49 
ddl96c8746c0e2ef36ca2aee21d4a36ai6 

g = 0b9cf 6a789959ed4f 36b701a5065154f 7f 4f 1517 
6d731b4897875d26a9e24415elll479050894ba7 
C532adal903c63a84ef7edc29c208a8ddd3fb5f7 
d43727b730f20d8el2cl7cd5cf9ab4358147cb62 
a9fb8878bfl5204e444ba6ade6132743i 6 

n = 1459b9617b8a9df 6bd54341307f 1256daf a241bd 
65b96edl4078e80dc6116001b83c5f88c7bbcb0b 
db237daac2e76df5b415d089baa0£d078516e60e 
2cdda7c26b858777604c5fbdl9f0711bc75ce00a 
5c37e2790b0d9d0ff9625c5ab9c7511di6 

where k = 30 (pi is the i-th odd prime) and the message is 
ASCII-encoded. The challenger should be the first to decrypt 
at least 50% of c and publish the cryptanalysis method but 
the authors are ready to carefully evaluate ad valorem any 
feedback they get. 
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Appendix: Sketch of the Security Proof. 

We show that any message distinguisher can be turned into 
an algorithm that recognizes higher residues. We let £ be a 
distinguisher for two messages mo and mi and start from the 
fact that, keeping the above notations, 0o and 6\ are signif- 
icantly distinct. We next use the hybrid technique for which 
we refer to [11], pp.91-93. Hybrids consist of a sequence of 
random variables Yi, 0 < i < k, such that 

1. Extreme hybrids collide with E(mo) and E(mi) re- 
spectively. 

2. Random values of each hybrid can be produced by a 
probabilistic polynomial time algorithm. 



3. There are only polynomially many hybrids. 

In such a situation, [11] shows that D distinguishes two 
neighbouring hybrids. Our hybrids are formed by consid- 
ering a message /«, such that 

/i» = mo mod pj for j > i and 

fii = mi modpj for j < i 

and letting Yi to be uniformly distributed over the set E(jh) 
of encryptions of ji;. It is easily seen that conditions 1, 2 
and 3 are satisfied. Thus, for some index i, D significantly 
distinguishes Yi and Y5-i* Set p = m, p = p» and let /r\ 
1 < j < p, be the unique message such that 

p? = p mod p* for £ ^ t and p? = j mod p 

We note that, both m» and m^-i appear among the p?s and 
we show that D cannot distinguish encryptions of any two 
of the p?&. This will yield the desired contradiction. 
Let 

*>• = Pr{D(n,<r,g,y) = l|y € E(pj)} 

and assume that some tt; significantly exceeds the other 
ones. In other words, m > sup J5fii irj + £ for some poly- 
nomial Q and infinitely many values of |n|. We show how 
to predict p-th residuosity: given z, we run D over a large 
sample N of inputs (n,a,y) where y = x^z^^g^, with 
x > n and £ < p chosen at random, and we average the 
outputs. Now, if z is a p-th residue, then y simply varies 
over E(ui), whereas, if z is not a p-th residue, y randomly 
varies over the union of all E(pj)s, Thus, in the first case, 
the average is close to m, whereas, in the second case, it is 

approximately It is easily seen that the difference 

is bounded from below by E = i ^. Using the law of large 
numbers, this is enough to make the proper decision on the 
p-th residuosity, with probability as close to 1 as we wish, 
by using only polynomially large samples. This finishes the 
proof. 
Remarks. 

1. Turning the previous sketch into a complete proof in- 
volves a technical but rather long write-up: especially, a 
precise version of the law of large numbers has to be made 
explicit, e.g. by using the Chebishev inequality. Also, the 

values of 7T» and *° — are not known a priori and should 
be approximated as well using the law of large numbers. We 
urge the interested reader to consult [11] for similar proofs. 

2. The higher residuosity oracle that was built in the proof 
for the sake of contradiction uses inputs <r and g on top of 
n, y and p. Actually, one can check that everything goes 
through, mutatis mutandis, if a is replaced by a — EIp<s P' 
Thus a is not really needed. As for g, as seen in section 2.1, 
it can be chosen at random: a proper choice will be spot- 
ted by sampling the corresponding oracle and checking its 
correctness. 



66 



Twin Signatures: an Alternative 
to the Hash-and-Sign Paradigm 



David Naccache 
Gemplus Card International 
34, rue Guynemer 
92447 issy-les-Moulineaux, France 

david.naccache@gemplus.com 



David Pointcheval Jacques Stern 
Ecole Normale Sup6rieure 
45, rue d'Ulm 
75230 Paris cedex 05, France 

{david.pointcheval,jacques.stern}@ ens.fr 



ABSTRACT 

This paper introduces a simple alternative to the hash-and- 
sign paradigm, from the security point of view but for sign- 
ing short messages, called twinning. A twin signature is 
obtained by signing twice a short message by a signature 
scheme. Analysis of the concept in different settings yields 
the following results: 

• We prove that no generic algorithm can efficiently forge 
a twin DSA signature. Although generic algorithms 
offer a less stringent form of security than computa- 
tional reductions in the standard model, such success- 
ful proofs still produce positive evidence in favor of the 
correctness of the new paradigm. 

• We prove in standard model an equivalence between 
the hardness of producing existential forgeries (even 
under adaptively chosen message attacks) of a twin 
version of a signature scheme proposed by Gennaro, 
Halevi and Rabin and the Flexible RSA Problem. 

We consequently regard twinning as an interesting alter- 
native to hash functions for eradicating existential forgery 
in signature schemes. 

Keywords 

Digital Signatures, Provable Security, Discrete Logarithm, 
Generic Model, Flexible RSA Problem, Standard Model. 

1. INTRODUCTION 

The well-known hash and sign paradigm has two distinct 
goals: increasing performance by reducing the size of the 
signed message and improving security by preventing exis- 
tential forgeries. As a corollary, hashing remains mandatory 
even for short messages. 

From the conceptual standpoint, the use of hash functions 
comes at the cost of extra assumptions such as the conjec- 
ture that for all practical purposes, concrete functions can 
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be identified with ideal black boxes [3] or that under cer- 
tain circumstances (black box groups [15, 21]) a new group 
element must necessarily come from the addition of two al- 
ready known elements. In some settings [11] both models 
are even used simultaneously. 

This paper investigates a simple substitute to hashing that 
we call twinning. A twin signature is obtained by signing 
twice the same (short) raw message by a probabilistic sig- 
nature scheme, or two probabilistically related messages. 

We believe that this simple paradigm is powerful enough 
to eradicate existential forgery in a variety of contexts. To 
support this claim, we show that no generic algorithm can 
efficiently forge a twin DSA signature and prove that for a 
twin variant of a signature scheme proposed by Gennaro, 
Halevi and Rabin [8] (hereafter GHR) existential forgery, 
even under an adaptively chosen-message attack, is equiva- 
lent to the Flexible RSA Problem [5] in the standard model. 

2. DIGITAL SIGNATURE SCHEMES 

Let us begin with a quick review of definitions and security 
notions for digital signatures. Digital signature schemes are 
the electronic version of handwritten signatures for digital 
documents: a user's signature on a message m is a string 
which depends on m, on public and secret data specific to 
the user and-possibly-on randomly chosen data, in such a 
way that anyone can check the validity of the signature by 
using public data only. The user's public data are called the 
public key, whereas his secret data are called the secret key. 
The intuitive security notion would be the impossibility to 
forge user's signatures without the knowledge of his secret 
key. In this section, we give a more precise definition of 
signature schemes and of the possible attacks against them 
(most of those definitions are based on [9]). 

2.1 Definitions 

A signature scheme is defined by the three following algo- 
rithms: 

• The key generation algorithm G. On input l fc , where k 
is the security parameter, the algorithm G produces a 
pair (k p , k s ) of matching public and secret keys. Algo- 
rithm G is probabilistic. 

• The signing algorithm E. Given a message m and a 
pair of matching public and secret keys (k p , k s ), £ pro- 
duces a signature cr. The signing algorithm might be 
probabilistic. 
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• The verification algorithm V. Given a signature cr, a 
message m and a public key k p , V tests whether c is 
a valid signature of m with respect to k p . In general, 
the verification algorithm need not be probabilistic. 

2.2 Forgeries and Attacks 

In this subsection, we formalize some security notions 
which capture the main practical situations. On the one 
hand, the goals of the adversary may be various: 

• Disclosing the secret key of the signer. It is the most 
serious attack. This attack is termed total break. 

• Constructing an efficient algorithm which is able to 
sign messages with good probability of success. This 
is called universal forgery. 

• Providing a new message-signature pair. This is called 
existential forgery. 

In many cases this latter forgery, the existential forgery, is 
not dangerous, because the output message is likely to be 
meaningless. Nevertheless, a signature scheme which is not 
existent ially unforgeable (and thus that admits existential 
forgeries) does not guarantee by itself the identity of the 
signer. For example, it cannot be used to certify randomly 
looking elements, such as keys. Furthermore, it cannot for- 
mally guarantee the non-repudiation property, since anyone 
may be able to produce a message with a valid signature. 

On the other hand, various means can be made available 
to the adversary, helping her into her forgery. We focus on 
two specific kinds of attacks against signature schemes: the 
no-message attacks and the known-message attacks. In the 
first scenario, the attacker only knows the public key of the 
signer. In the second one, the attacker has access to a list 
of valid message-signature pairs. According to the way this 
list was created, we usually distinguish many subclasses, but 
the strongest is the adaptively chosen-message attack, where 
the attacker can ask the signer to sign any message of her 
choice. She can therefore adapt her queries according to 
previous answers. 

When one designs a signature scheme, one wants to com- 
putationally rule out existential forgeries even under adap- 
tively chosen-message attacks, which is the strongest secu- 
rity level for a signature scheme. 

3. GENERIC ALGORITHMS 

Before we proceed, let us stress that although the generic 
model in which we analyze DSA offers a somehow weaker 
form of security than the reductions that we apply to GHR 
in the standard model, it still provides evidence that twin- 
ning may indeed have a beneficial effect on security. 

Generic algorithms [15, 21], as introduced by Nechaev and 
Shoup, encompass group algorithms that do not exploit any 
special property of the encodings of group elements other 
than the property that each group element is encoded by 
a unique string. Typically, algorithms like Pollard's p al- 
gorithm [18] fall under the scope of this formalism while 
index-calculus methods do not. 

3.1 The Framework 

Recall that any Abelian finite group T is isomorphic to a 
product of cyclic groups of the form (Z p fc,+), where p is a 
prime. Such groups will be called standard Abelian groups. 



An encoding of a standard group T is an injective map from 
T into a set of bit-strings S. 

We give some examples: consider the multiplicative group 
of invertible elements modulo some prime q. This group is 
cyclic and isomorphic to the standard additive group T = 
Zq-i. Given a generator g> an encoding a is obtained by 
computing the binary representation a(x) of g x mod q. The 
same construction applies when one considers a multiplica- 
tive subgroup of prime order r. Similarly, let E be the group 
of points of some non-singular elliptic curve over a finite field 
F, then E is either isomorphic to a (standard) cyclic group 
T or else is isomorphic to a product of two cyclic groups 
%di x Z<f 2 * In the first case, given a generator G of E, an 
encoding is obtained by computing a(x) = x.G, where x.G 
denotes the scalar multiplication of G by the integer x and 
providing coordinates for <j(x). The same construction ap- 
plies when E is replaced by one of its subgroups of prime 
order r. Note that the encoding set appears much larger 
than the group size, but compact encodings using only one 
coordinate and a sign bit ±1 exist and for such encodings, 
the image of a is included in the binary expansions of inte- 
gers < tr for some small integer £, provided that r is close 
enough to the size of the underlying field F. This is exactly 
what is recommended for cryptographic applications [10]. 

A generic algorithm A over a standard Abelian group T 
is a probabilistic algorithm that takes as input an encod- 
ing list {<t(xi),*-* ,<r(xk)}, where each Xi is in T. While 
it executes, the algorithm may consult an oracle for further 
encodings. Oracle calls consist of triples e}, where i 
and j are indices of the encoding list and e is ±. The oracle 
returns the string cr(xi ± Xj), according to the value of e 
and this bit-string is appended to the list, unless it was al- 
ready present. In other words, A cannot access an element 
of T directly but only through its name o~(x) and the oracle 
provides names for the sum or difference of two elements 
addressed by their respective names. Note however that A 
may access the list at any time. In many cases, A takes 
as input a pair {cr(l),a(x)}. Probabilities related to-such 
algorithms are computed with respect to the internal coin 
tosses of A as well as the random choices of a and x. 

The following theorem appears in [21]: 

Theorem 1. Let r be a standard cyclic group of order 
N and let p be the largest prime divisor of N. Let A be a 
generic algorithm over T that makes at most n queries to the 
oracle. If x G T and an encoding a are chosen at random, 
then the probability that A returns x on input {a(l),cr(x)} 
is G(n 2 /p). 

Proof. We refer to [21] for a proof. However, we will 
need, as an ingredient for our own proofs, the probabilistic 
model used by Shoup. We develop the model in the special 
case where N is a prime number r, which is of interest to us. 
Alternatively, we could work in a subgroup of prime order 
r. 

Basically, we would like to identify the probabilistic space 
consisting of a and x with the space S n+2 x T, where S is the 
set of bit-string encodings. Given a tuple {zi, * * • , z n + 2, y] 
in this space, z\ and zi are used as a(l) and cr(x), the suc- 
cessive Zi are used in sequence to answer the oracle queries 
and the unique value y from T serves as x. However, this 
interpretation may yield inconsistencies as it does not take 
care of possible collisions between oracle queries. To over- 
come the difficulty, Shoup defines, along with the execution 
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of A y a sequence of linear polynomials Fi(X), with coeffi- 
cients modulo r. Polynomials F\ and F2 are respectively set 
to Fi = 1 and F2 = X and the definition of polynomial Ft 
is related to the £-th query {i, j, e}: Ft — Fi± Fj, where the 
sign ± is chosen according to e. If Ft is already listed as 
a previous polynomial Fh, then Ft is. marked and A is fed 
with the answer of the oracle at the h-th query. Otherwise, 
zt is returned by the oracle. Once A has come to a stop, the 
value of x is set to y. 

It is easy to check that the behavior of the algorithm which 
plays with the polynomials F» is exactly similar to the behav- 
ior of the regular algorithm, if we require that y is not a root 
of any polynomial Fi— Fj, where i, j range over indices of un- 
marked polynomials. A sequence {z\ , • * • , z n +2, y) for which 
this requirement is met is called a safe sequence. Shoup 
shows that, for any {21, • ** , z n +2}, the set of y such that 
{21, * * * jZn+2,2/} is not safe has probability 0(n 2 /r). Prom 
a safe sequence, one can define x as y and o as any encoding 
which satisfies a(Fi(y)) = Zi y for all unmarked F». This cor- 
respondence preserves probabilities. However, it does not 
completely cover the sample space {<r,x} since executions 
such that Fi(x) = Fj(x), for some indices i, j, such that Fi 
and Fj are not identical are omitted. To conclude the proof 
of the above theorem in the special case where N is a prime 
number r, we simply note that the output of a computa- 
tion corresponding to a safe sequence {zi, * * * , z n +2,y} does 
not depend on y. Hence it is equal to y with only minute 
probability. □ 

3.2 Digital Signatures over Generic Groups 

We now explain how generic algorithms can deal with at- 
tacks against DSA-like signature schemes [6, 20, 16, 10]. 
We do this by defining a generic version of t)SA that we 
call GDSA. Parameters for the signature include a stan- 
dard cyclic group of prime order r together with an encod- 
ing 0. The signer also uses as a secret key /public key pair 
{x,a(x)}. Note that we have chosen to describe signature 
generation as a regular rather than generic algorithm, using 
a full description of a. To sign a message m, 1 < m < r the 
algorithm executes the following steps: 

1. Generate a random number u, 1 < u < r. 

2. Compute c <— cr(u) mod r. If c = 0 go to step 1. 

3. Compute d <— u~ x (m + xc) mod r. If d = 0 go to step 
1. 

4. Output the pair {c,d} as the signature of m. 
The verifier, on the other hand, is generic: 

1. If c £ [1, r - 1] or d £ [1, r - 1], output invalid and stop. 

2. Compute h *— d~ l mod r, h\ <— hm mod r and /i2*<— 
he mod r. 

3. Obtain o(h\ + fox) from the oracle and compute c' *— 
<r(hi + h2x) mod r. 

4. If c ^ c' output invalid and stop otherwise output valid 
and stop. 

The reader may wonder how the verifier obtains the value 
of a requested at step 3. This is simply achieved by mim- 
icking the usual double-and-add algorithm and asking the 



appropriate queries to the oracle. This yields and 
a(/i2x). A final call to the oracle completes the task. 

A generic algorithm A can also perform forgery attacks 
against a signature scheme. This is defined by the ability 
of A to return on input {a(l),a(x)} a triple {m,c,d} £ T 3 
for which the verifier outputs valid. Here we assume that 
both algorithms are performed at a stretch, keeping the same 
encoding list. 

To deal with adaptive attacks one endows A with another 
oracle, called the signing oracle. To query this oracle, the 
algorithm provides an element m e T. The signing oracle 
returns a valid signature {c, d} of m. Success of A is defined 
by its ability to produce a valid triple {m,c,<£}, such that 
m has not been queried during the attack. 

Such a forgery can be easily performed against this GDSA 
scheme, even with just a passive attack: the adversary choos- 
es random numbers hi and fe, 1 < /ii, /12 < r and computes 
c «— a(h\ + h2x) mod r. Then it defines d — c/ij 1 mod r, 
h = d~ x mod r, and eventually m = dh\ mod r. The triple 
{m, c, d} € T 3 is therefore a valid one, unless c = 0, which 
is very unlikely. 

4. THE SECURITY OF TWIN GDSA 
4.1 A Theoretical Result 

The above definitions extend to the case of twin signa- 
tures, by requesting the attacker A to output an m and two 
distinct pairs {c>d} e T 2 , {c',d'} G T 2 . Success is granted 
as soon as the verifying algorithm outputs valid for both 
triples 1 . We prove the following: 

Theorem 2. Let F be a standard cyclic group of prime 
order r. Let S be a set of bit-string encodings of cardinality 
at least r, included in the set of binary representations of 
integers < tr } for some t. Let A be a generic algorithm over 
T that makes at mostn queries to the oracle. Ifx 6 T and an 
encoding a are chosen at random, then the probability that 
A returns a message m together with two distinct GDSA 
signatures of m on input {a(l),a(x)} is 0(tn 2 /r). 

Proof. We cover the non adaptive case and tackle the 
more general case after the proof. We use the probabilistic 
model developed in section 3.1. Let A be a generic attacker 
able to forge some m and two distinct signatures {c, d} and 
{c', d'}. We assume that, once these outputs have been pro- 
duced, A goes on checking both signatures; we estimate the 
probability that both are valid. 

We restrict our attention to behaviors of the full algorithm 
corresponding to safe sequences {21, • • * ,2n+2,2/}- By this, 
we discard a set of executions of probability Q(n 2 /r). We 
let P be the polynomial (md~~ l ) + (cd~ 1 )X and Q be the 
polynomial (md'" 1 ) + (c'd'" 1 )^. 

• We first consider the case where either P or Q does 
not appear in the Fi list before the signatures are pro- 
duced. If this happens for P, then P is included in the 
Fi list at signature verification and the corresponding 
answer of the oracle is a random number Zi. Unless 
Zi — c mod r, which is true with probability at most 

1 using [14] the simultaneous square-and-multiplv generation 
or verification of two DSA signatures is only 17% slower than 
the generation or verification of a single signature. 
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t/r, the signature is invalid. A similar bound holds for 

Q. 

• We now assume that both P and Q appear in the Fi 
list before A outputs its signatures. We let i denote the 
first index such that Fi — P and J the first index such 
that Fj =Q. Note that both Fi and Fj are unmarked 
(as defined in section 3.1). If i = j, then we obtain 
that rod -1 = md'~ l and cd~ l — c'd'" 1 . Prom this, it 
follows that c = c', d = d! and the signatures are not 
distinct. 

• We are left with the case where i ^ j. We let £1»,^ , 
i < j, be the set of safe sequences producing two sig- 
natures such that the polynomials P, Q, defined as 
above appear for the first time before the algorithm 
outputs the signatures, as Fi and Fj. We consider a 
fixed value w for {21, • • ■ ,Zj-i} and let w be the set 
of safe sequences extending w. We note that Fi and 
Fj are defined from w and we write F» = a 4- 6X, 

= a' -j- b'X. We claim that fiij Pi w has proba- 
bility < t/r. To show this, observe that one of the 
signatures that the algorithm outputs is necessarily of 
the form {c, d}, with c — z% mod r, c = d& mod r and 
m — da mod r. Now, the other signature is {c', d'} and 
since m is already defined we get d' = ma' -1 mod r 
and c' = 6'd' mod r. This in turn defines Zj mod 7* 
within a subset of at most t elements. From this, the 
required bound follows and, from the bound, we infer 
that the probability of Qij is at most t/r. 

Summing up, we have bounded the probability that a safe 
sequence produces an execution of A outputting two valid 
signatures by 0(tn 2 /r). This finishes the proof. □ 

In the proof, we considered the case of an attacker forg- 
ing a message-signature pair from scratch. A more elab- 
orate scenario corresponds to an attacker who can adap- 
tively request twin signatures corresponding to messages of 
his choice. In other words, the attacker interacts with the 
legitimate signer by submitting messages selected by its pro- 
gram. 

We show how to modify the security proof that was just 
given to cover the adaptive case. We assume that each time 
it requests a signature the attacker A immediately verifies 
the received signature. We also assume that the verification 
algorithm is normalized in such a way that, when verifying 
a signature {c,d} of a message m, it asks for o{(md~ l ) + 
(cd -1 )x) after a fixed number of queries, say q. We now 
explain how to simulate signature generation: as before, 
we restrict our attention to behaviors of the algorithm cor- 
responding to safe sequences {21,- •• , <Zn+2,y}. When the 
(twin) signature of m is requested at a time of the compu- 
tation when the encoding list contains i elements, one picks 
Zi+q and Zi+2q and manufactures the two signatures as fol- 
lows: 

1. Let c<- Zi+q mod r, pick d at random. 

2. Let c' *— Zi+2q mod r, pick d' at random. 

3. Output {c,d} and {c',d'} as the first and second sig- 
natures. 



While verifying both signatures, A will receive the ele- 
ments Zi+q and Zi+2qi as 

a((md _1 ) + (cd'^x) and cr((md /_1 ) + (c'd' _1 )x) 

respectively, unless Fi+ q or Fi+2q appears earlier in the Fi 
list. Due to the randomness of d and d', this happens with 
very small probability bounded by n/r. Altogether, the sim- 
ulation is spotted with probability 0(n 2 /r) which does not 
affect the 0(tn 2 /r) bound for the probability of successful 
forgery. 

4.2 Practical Meaning of the Result 

We have shown that, in the setting of generic algorithms, 
existential forgery against twin GDSA has a minute success 
probability. Of course this does not tell anything on the se- 
curity of actual twin DSA. Still, we believe that our proof has 
some practical meaning. The analogy with hash functions 
and the random oracle model [3] is inspiring: researchers 
and practitioners are aware that proofs in the random ora- 
cle model are not proofs but a mean to spot design flaws and 
validate schemes that are supported by such proofs. Still, 
all standard signature schemes that have been proposed use 
specific functions which are not random by definition; our 
proofs seem to indicate that if existential forgery against 
twin DSA is possible, it will require to dig into structural 
properties of the encoding function. This is of some help for 
the design of actual schemes: for example, the twin DSA de- 
scribed in Appendix A allows signature with message recov- 
ery without hashing and without any form of redundancy, 
while keeping some form of provable security. This might be 
considered a more attractive approach than [17] or [1], the 
former being based on redundancy and the latter on random 
oracles. We believe that twin DSA is even more convincing 
in the setting of elliptic curves, where there are no known 
ways of taking any advantage of the encoding function. 

5. AN RSA-BASED TWINNING 
IN THE STANDARD MODEL 

The twin signature scheme described in this section be- 
longs to the (very) short list of efficient schemes provably se- 
cure in the standard model: in the sequel, we show that pro- 
ducing existential forgeries even under an adaptively chosen- 
message attack is equivalent to solving the Flexible RSA 
Problem [5]. 

Security in the standard model implies no ideal assump- 
tions; in other words we directly reduce the Flexible RSA 
Problem to a forgery. As a corollary, we present an efficient 
and provably secure signature scheme that does not require 
any hash function. 

Furthermore, the symmetry provided by twinning is much 
simpler to analyze than Cramer-Shoup's proposal [5] which 
achieves a similar security level, and similar efficiency, with 
a rather intricate proof. 

5.1 Gennaro-Halevi-Rabin Signatures 

In [8] Gennaro, Halevi and Rabin present the following 
signature scheme: Let n be an £-b\t RSA modulus [19], H 
a hash-function and y G Z£. The pair {n, y} is the signer's 
public key, whose secret key is the factorization of n. 

• To sign m, the signer hashes e H(m) (which is very 
likely to be co-prime with y?(n)) and computes the e-th 



23 



root of y modulo n using the factorization of n: 
s +— y 1 ^ mod n 
• To verify a given {m,s}, the verifier checks that 
' a H{m) mod n=y. 

Security relies on the Strong RSA Assumption. Indeed, 
if H outputs elements that contain at least a new prime 
factor, existential forgery is impossible. Accordingly, Gen- 
naro et al. define a new property that H must satisfy to 
yield secure signatures: division intractability. Division in- 
tractability means that it is computationally impossible to 
find ai, . . . , afc and b such that H(b) divides the product of 
all the H(ai). In [8], it is conjectured that such functions 
exist and heuristic conversions from collision-resistant into 
division-intractable functions are shown (see also [4]). 

Still, security against adaptively chosen-message attacks 
requires the hash function H to either behave like a random 
oracle model or achieve the chameleon property [12]. This 
latter property, for a hash function, provides a trapdoor 
which helps to find second preimages, even with some fixed 
part. Indeed, some signatures can be pre-computed, but 
with specific exponents before outputting y: y = i ei mod 
n for random primes e% = H(mi,n). 

Using the chameleon property, for the i-th query m to 
the signing oracle, the simulator who knows the trapdoor 
can get an r such that H(mi 1 ri) = ff(m,r) = ei. In the 
random oracle model, one simply defines #(m,r) <— e{. 

Then s = x^^ iC J = y 1 ^ mod n and the signature there- 
fore consists of the triple {m, r, s} satisfying 

Cramer and Shoup [5] also proposed a scheme based on 
the Strong RSA Assumption, the first practical signature 
scheme to be secure in the standard model, but with univer- 
sal one-way hash functions; our twin scheme will be similar 
but with a nice symmetry in the description (which helps 
for the security analysis) and no hash-functions, unless one 
wants to sign a long message. 

5.2 Preliminaries 

We build our scheme in two steps. The first scheme resists 
existential forgeries when subjected to no-message attacks. 
Twinning will immune it against adaptively chosen-message 
attacks. 

5.2.1 Injective function into the prime integers. 
Before any description, we will assume the existence of 

a function p with the following properties: given a security 
parameter k (which will be the size of the signed messages), 
p maps any string from {0, 1} fc into the set of the prime inte- 
gers, p is also designed to be easy to compute and injective. 
A candidate is proposed and analyzed in Appendix B. 

5.2.2 The Flexible RSA Problem and the Strong RSA 
Assumption. 

Let us also recall the Flexible RSA Problem [5]. Given an 
RSA modulus n and an element y G ZJ, find any exponent 
e > 1, together with an element x such that x e = y mod n. 

The Strong RSA Assumption is the conjecture that this 
problem is intractable for large moduli. This was indepen- 



dently introduced by [2, 7], and then used in many further 
security analyses (e.g. [5, 8]). 

5.3 A First GHR Variant 

The first scheme is very similar to GHR without random 
oracles but with function p instead: 

• To sign m G {0, l} fc , the signer computes e «— p{m) 
and the e-th root of y modulo n using the factorization 
of n 

s <— y 1 ^ mod n 

• To verify a given {m, $}, the verifier checks that 

s p(m) mod n = y. 

Since p provides a new prime for each new message (in- 
jectivity), existential forgery contradicts the Strong RSA 
Assumption. However, how can we deal with adaptively 
chosen-message attacks without any control over the output 
of the function p, which is a publicly defined non-random 
oracle and not a trapdoor function either? 

5.4 The Twin Version 

The final scheme is quite simple since it consists in du- 
plicating the previous one: the signer uses two £-bit RSA 
moduli ni, ri2 and two elements j/i, yi in ZJ^ and Z£ 2 re- 
spectively. Secret keys are the prime factors of the n%. 

• To sign a message m, the signer probabilistically de- 
rives two messages 1*1,1*2 € {0, l} fe , (from m and a 
random tape a;), computes ei <— p(fJ>i) and then the 
ei-th root of y» modulo n*, for i — 1,2, using the fac- 
torization of the moduli: 

{s\ *— y\^ ex mod ni, S2 «— y¥* 2 m od 712} 

• To verify a given {m,u>, si, S2}, the verifier computes 

/ii and ^2, then checks that s^ C/Xi) mod = y<, for 
i = 1,2. 

To prevent forgeries, a new message must involve a new 
exponent, either ei or e2, which never occurred in the sig- 
natures provided by the signing oracle. Therefore, a first 
requirement is that fx\ and ^2 define at most one message 
m, but only if they have been correctly constructed. Thus, 
some redundancy is furthermore required. 

We thus suggest the following derivation, to get /zi and 
/Z2 from m e {0, l} fc / 2 (we assume k to be even): one, 
chooses two random elements a, & £ {0, l} fc ^ 2 , then u.\ — 
(m® a)||(m© b) and 112 = a\\b. 

Clearly, given fxi and /i2, one gets back M = /ii © /i2, 
which provides a valid message if and only if the redundancy 
holds: M ~ M, where S and S denote the two fc/2-bit 
halves of a fc-bit string 5, the most significant and the least 
significant parts respectively. 

5.5 Existential Forgeries 

Let us show that existential forgery of the twin scheme, 
with above derivation process, leads to a new solution of the 
Flexible RSA Problem: 

Lemma 1. After q queries to the signing oracle, the prob- 
ability that there exist a new message m and values a,b } 
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which lead to fi\ = (m © a)||(m © 6) and \xi — a||6, such 
that both e\ = p(fii) and e% — p{in) already occurred in the 
signatures provided by the signing oracle is less than q 2 /2 k ^ 2 . 

Proof. Let {mt,ai,&i,si,i,S2,i} denote the answers of 
the signing oracle. Using the injectiyity of p, the existence 
of such m, a and b means that there exist indices i and j for 
which 

(m©a)||(m©&) = /zi = = (m* © aO||(mi © bi) 
a\\b = H2 = V>2,j = a,j\\bj. 

Then 

a©6 = (m©a)©(m©&) = (m* ©ai) © (m* ©6») = a* ©6i, 
and 

a © 6 = aj © bj . 

Therefore, for a j > i (the case i > j is similar), the new 
random elements aj,bj must satisfy a,- ©6^ = a» ©6*. Since 
it is randomly chosen by the signer, the probability that this 
occurs for some i < j is less than (j - l)/2 k ^ 2 . 

Altogether, the probability that for some j there exists 
some i < j which satisfies the above equality is less that 
q 2 /2 x 2~ fc/2 . By symmetry, we obtain the same result if we 
exchange i and j. 

The probability that both exponents already appeared is 
consequently smaller than q 2 /2 k ^ 2 . □ 

To prevent adaptively chosen-message attacks, we need 
no trapdoor property for p, nor random oracle assumption 
either. We simply give the factorization of one modulus to 
the simulator, which can use any pre-computed exponenti- 
ation with any new message, as when chameleon functions 
are used [8]. 

5.6 Adaptively Chosen-Message Attacks 

Indeed, to prevent adaptively chosen-message attacks, one 
just needs to describe a simulator; our simulator works as 
follows: 

• The simulator is first given the moduli ni,ri2 and the 
elements yi € Z£ 13 j/2 6 Z£ 2 , as well as the factor- 
ization of ri-Y, where 7 is randomly chosen in {1,2}. 
To simplify notations we assume that 7=1. And 
the following works without loss of generality since the 
derivation of /ii and \i2 is perfectly symmetric: they 
are randomly distributed, but satisfy pb\ ©/12 = m\\m 
(it is a perfect secret sharing). 

• The simulator randomly generates q values 62 j <— 
p(/X2,j), with randomly chosen fi2j Gr {0, l} k for j = 
1, . . . , q and computes 

<2^2/2 1 ' =1, "' ,9C2,i modn 2 . 

The new public key for the signature scheme is the 
following: the moduli ni,ri2 with the elements in 
Z£ t and Zn 2 respectively. 

• For the j-th signed message m, the simulator first gets 
(a\\b) <— (m||m) © ^2j- It therefore computes |/i <— 
a||6, and thus \i2 <— 1*2 j — (m © a)||(m © b). 

Then, it knows S2 = y^** 3 C2 '* mod ri2, and computes 
$1 using the factorization of m. 



Such a simulator can simulate up to q signatures, which 
leads to the following theorem. 

Theorem 3. Let us consider an adversary against the 
twin- GHR scheme who succeeds in producing an existen- 
tial forgery j with probability greater than e, after q adaptive 
queries to the signing oracle in time t, then the Flexible RSA 
Problem can be solved with probability greater than e' within 
a time bound t', where 

£ ' = \ { £ ~ 2^2) and l ' = 1 + °( q X £2 X ®' 

Proof. Note that the above bounds are almost optimal 
since e f ^ e/2 and t r ^ 2t. Indeed, the time needed to 
produce an existential forgery after q signature queries is 
already in 0(q x (|ni| 2 + |n2| 2 )fc). To evaluate the success 
probability, q is less than say 2 , but k may be taken greater 
than 160 bits (and even much more). 

To conclude the proof, one just needs to address the ran- 
dom choice of 7. As we have seen in Lemma 1, with proba- 
bility greater than e - q 2 /2 k f 2 , one of the exponents in the 
forgery never appeared before. Since 7 is randomly chosen 
and the view of the simulation is perfectly independent of 
this choice, with probability of one half, e = is new. Let 
us follow our assumption that 7=1, then 

s e = S2 = z = yj mod 712, 

where 7r = 11^=1,... tQ e 2.i- Since e is new, it is relatively 
prime with tt, and therefore, there exist u and v such that 
ue + vir— 1: let us define x = yjs" mod 722, 

x e = (j, 2 " S ") e = yl-™s ev = !*(»?)- W = V2 mod n 2 . 

We thus obtain an e-th root of the given j/2 modulo 712, for 
a new prime e. □ 

5.7 More Signatures 

One may remark that the length of the messages we can 
sign with above construction is limited to k/2 bits, because 
of the required redundancy. But one can increase the size, by 
signing three derived messages: in order to sign m 6 {0, 1}*, 
one chooses two random elements a, b e {0, l} fc / 2 (we still 
assume k to be even), and signs with different moduli 

— m © (a||6) 
\i2 = a\\b 
fj>3 = m © (b\\a). 

6. CONCLUSION 

AND FURTHER RESEARCH 

We proposed an alternative to the well-known hash-and- 
sign paradigm, based on the simple idea of signing twice 
(or more) identical or related short messages. We believe 
that our first investigations show that this is a promising 
strategy, deserving further study. 

A number of interesting questions remain open. First, 
from the efficiency point of view, which is a frequent concern, 
we are aware that the current proposals do not deal with 
either the computational cost, or the communication load, in 
an efficient way. Thus, for example, can the number of fields 
in a twin DSA be reduced from four ({c, d} and {c\d'}) to 
three or less? Can we also suppress some fields in the twin- 
GHR, or sign A;-bit long messages with only two signatures? 
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Finally, can an increase in the number of signatures (e.g. 
three instead of two) yield better security bounds? 
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APPENDIX 

A. TWIN SIGNATURES 

WITH MESSAGE RECOVERY 

In this appendix, we describe a twin version of the Nyberg- 
Rueppel scheme [17] which provides message recovery. Keep- 
ing the notations of section 4.1: 

1. Generate a random number u, 1 < u < r. 

2. Compute c <— cr(u) -f m mod r. If c = 0 go to step 1. 

3. Compute an integer d «— u — cx mod r. 

4. Output the pair {c, d} as the signature. 

In the above, / is what is called in [10] a message with 
appendix. It simply means that it has an adequate redun- 
dancy. The corresponding verification is performed by the 
following (generic) steps: 

1. If c & [I, r — 1] or d & [0, r — 1], output invalid and stop. 

2. Obtain a(d + cx) from the oracle and compute 7 <— 
<r(d + cx) mod r . 

3. Check the redundancy of m <— c - 7 mod r. If in- 
correct output invalid and stop; otherwise output the 
reconstructed message m, output valid and stop. 

In the twin setting, signature generation is alike but is per- 
formed twice, so as to output two distinct signatures. How- 
ever, no redundancy is needed. The verifier simply checks 
that the signatures are distinct and outputs two successive 
versions of the message, say m and m'. It returns valid 
if m == m' and invalid otherwise. The security proof is 
sketched here, we leave the discussion of adaptive attacks 
to the reader. 

We keep the notations and assumptions of section 4 and 
let A be a generic attacker over V which outputs, on input 
{a(l),a(x)}, two signature pairs {c, d], {c r , d'} and runs the 
verifying algorithm that produces from these signatures two 
messages m, m' and checks whether they are equal. We 
wish to show that, if x 6 T and an encoding a are chosen at 
random, then the probability that m = m' is 0(tn 2 /r). 

As before, we restrict our attention to behaviors of the full 
algorithm corresponding to safe sequences {21,*-* , z n ,y}- 
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We let P, Q be the polynomials d + cX and d' + c'A\ We 
first consider the case where either P or Q does not appear 
in the Fi list before the signatures are produced. If this 
happens for P, then, P is included in the Ft list at signature 
verification and the corresponding answer of the oracle is a 
random . number z%. Since m is computed as c — Zi mod r, 
the probability that m = m' is bounded by t/r. A similar 
bound holds for Q. 

We now assume that both P and Q appear in the P» 
list before A outputs its signatures. We let i denote the 
first index such that F» = P and j the first index such 
that Fj = Q. Note that both Fx and Fj are unmarked (as 
defined in section 3.1). If i = j, then we obtain that c = c' 
and d = d f . From this, it follows that the signatures are not 
distinct. 

As in section 4, we are left with the case where i ^ j 
and we define f^j, i < j, to be the set of safe sequences 
producing two signatures such that the polynomials P, Q, 
defined as above appear for the first time before the al- 
gorithm outputs the signatures, as F» and Fj, We show 
that, for any fixed value w = {21,-** , Zj-i}, Qij ntw has 
probability < t/r, where w is defined as above. Since we 
have m — c — Zi mod r and m' = c' - Zj mod r, we obtain 
Zj f = c' - c 4- Zi mod r, from which the upper bound follows. 
From this bound, we obtain that the probability of Qij is at 
most t/r and, taking the union of the various Qt.jS, we con- 
clude that the probability to obtain a valid twin signature 
is at most 0(tn 2 /r). 

B. THE CHOICE OF FUNCTION p 
B.l A Candidate 

The following is a natural candidate: 

p:{0,l}* _> V 

m h- ► nextprime(m x 2 r ) 

where r is suitably chosen to guarantee the existence of a 
prime in any set [mx2 T j(m+l)x 2 T [, for m < 2 k . 

Note that the deterministic property of nextprime is not 
mandatory, one just needs it to be injective. But then, the 
preimage must be easily recoverable from the prime: the 
exponent is sent as the signature, from which one checks 
the primality and extracts the message (message-recovery). 

B.2 Analysis 

It is clear that any generator of random primes, using m as 
a seed, can be considered as a candidate for p. The function 
proposed above is derived from a technique for accelerating 
prime generation called incremental search (e.g. [13], page 
148). 

1. Input: an odd fc-bit number no (derived from m) 

2. Test the s numbers no, no + 2, . . . , no 4- 2(5 — 1) for 
primality 

Under reasonable number-theoretic assumptions, if $ = 
C'\n2 k j the probability of failure of this technique is smaller 
than 2e~ 2c , for large k. 

Using our notations, in such a way that there exists at 
least a prime in any set [m x 2 T , (m + 1) x 2 r [, but with 
probability smaller than 2~ 80 , we obtain from above formu- 
lae that c 40, and T > 40 In 2 k+T+1 . Therefore, a suitable 



candidate is r = 5 log 2 fc, and less than 20& primality tests 
have to be performed. 

B.3 Extensions 

B.3.1 Collision-resistance: 

To sign large messages (at the cost of extra assumptions), 
one can of course use any collision-resistant hash-function h 
before signing (using the classical hash-and-sign technique). 
Clearly, the new function m h-> p(h{m)) is not mathemati- 
cally injective, but just computationally injective (which is 
equivalent to collision-resistance), which is enough for the 
proof. 

B.3. 2 Division intractability: 

If one wants to improve efficiency, using the division- 
intractability conjecture proposed in [8], any function that 
outputs fe-bit strings can be used instead of p. More pre- 
cisely: 

Definition (Division Intractability). A function H is 
said (n, 1/, r)-division intractable if any adversary which runs 
in time r cannot find, with probability greater than v } a set 
of elements ai, a n and b such that H(b) divides the 
product of all the H(ai). 

As above, that function p would not be injective, but just 
collision-resistant, which is enough to prove the following: 

Theorem 4. Let us consider the twin-GHR scheme where 
p is any (9,e,t) -division-intractable hash function. Let us 
assume that an adversary A succeeds in producing an ex- 
istential forgery under an adaptively chosen-message attack 
within time t and with probability greater than e, after q 
queries to the signing oracle. Then one can either contra- 
dict the division-intractability assumption or solve the Flex- 
ible RSA Problem with probability greater than e 1 within a 
time bound t', where 
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Abstract : The signature generation phase of most 
DLP-based signature schemes (for instance 
Schnorr[ 1 0], H-Gamal[4] or the newly 
standardized D.S.A.[3]) includes the time- 
consuming computation of r~ g K mod p where k 
is random. 

This paper introduces a new computational 
strategy that can apply in this particular context : 

A batch exponentiation technique which allows 
the generation of large sets of exponentials 
without introducing any bias between the ks (that 
is, the signer can batch-compute the exponentials 
corresponding to arbitrarily imposed powers -for 
instance by an external random number 
generator). Our method offers real improvements 
over the prior art with various time and memory 
trade-offs. 



1. Introduction 

In many DLP-based signature schemes 1 the signer performs the 
operation r = g K mod p where k is random. As the signer is often 
the " weak party " in the signature protocol, several authors tried 
to accelerate the exponentiation by pre-computing values [11, [5] 
or sub-contracting a part of the exponentiation workload to the 
verifier [6] (provided that a set of precautions is taken into 
consideration). Except the fact that some of these algorithms were 
broken [7], [8], extra memory storage is frequently an unrealistic 
assumption. 

In this paper, we investigate a strategy for improving the 
generation of r : the method (providing improvements ranging 
from 42% to 85% over the square~&-multiply algorithm) can 
apply to the batch generation of fixed-g-based signatures without 
introducing any bias into the exponents (that is, we assume that 
the ks are Imposed to the signer by some random source). We 
assume that no pre-computation is allowed other than what 
needed to execute similar size basic square-& multiply. The new 
method may as well open the way to interesting developments for 
accelerating the computation of discrete logarithms. 

.Permission to make digital/hard copies of all or part of this material for 
personal or classroom use is granted without fee provided that the copies 
are not made or distributed for profit or commercial advantage, the copy- 
right notice, the title of the publication and its date appear, and notice is 
given that copyright is by permission of the ACM, Inc. To copy otherwise, 
to republish, to post on servers or to redistribute to Hsu, requires specific 
permission and/or fee. 
CCS '96, New Delhi, India 
° 1996 ACM 0-89791-829~0/96/03..$3.50 



2. Unbiased Batch-Exponentiation 

The simultaneous signature of many messages by a shop terminal 
or the processing of electronic documents by an administration 
frequently involves massive computations consisting in the 
repetition of small operations. The re-combination of these 
computations for minimizing the computational effort is thus an 
interesting research direction (see for instance [10]). 
The batch exponentiation technique proposed hereafter is built 
around the following observation : since the exponents are 
random, the uniform distribution of ones in different exponents is 
expected to have some matching patterns. Considering this Fact, 
our batch-exponentiation strategy consists in minimizing the 
signer's workload bv exponentia ting the intersection separately 
and resetting the corresponding bits in the initial exponents. 

2.1 Parallel square~&-multiply (straightforward) 

Let n be the size of the exponents and N the number of 
exponentials to be computed. The usual method to generate the rs 
consists in calculating successive squares of g and performing the 
required multiplications selected by the bits of each (about nil 
multiplications). 

Let S = {*j|i = l N] be a set of random powers. The 

corresponding set of exponentials 

/? = {^|i = l..„,^} 

can be computed by performing the successive squares of g only 
once. Thus, the total computational effort mainly relies oa the 
average number of multiplications, depending on the Hamming 
weight of each random exponent 

The algorithm P$M(S,R), of complexity 

E(N)=N(n/2-\)+(n-\) 
using N + 1 registers, is : 

for to N r t <-l 
fory<-0 to n-1 

fori**ltoN if k(lj]^\ then tj <- J/j * gjmod p 
g<-g*gmodp 

n-1 

where *,* = X WW . 
7=0 

%1 The Basic Strategy 

Denoting S a set of N random powers k t the strategy to generate 
the related exponentials is the following : 

OCut S into T~[N/L] sets sh = {*; }j^ L wherc L 5 5 * 
2^-1 

GLet P(s n )= \Js n i where sh t i are s n *s subsets with s/j^ * 0 
i=l 



1 for instance [3], [4] and [10] 

2 L value depends on the exponent length; further explanation will 
be found in 3.2 
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©For/i<-l to 7 

<DFori<-lto 2 L -\ cfo= ® kj 

<Z>ForC<-L-ltol 

Fori <-lto 2 L -\ 

if Card(shj) = Clhen 

0Fori 4-ltoL 

The idea is to operate on each subset, reseting the common bits of 
the powers, and compute the exponentials together to save both 
squarrings and multiplications. The following section will 
describe the method for a reduced only two-power set. 

2.3 Exponent Combination Method 

Hereafter, we only consider the number of multiplications 
required to compute the rfs assuming that it is possible to 
calculate the squares only once by the previously described 
parallel method. 

Denoting 

i i 

and assuming that g a mod p and g^ mod p are to be computed, 

let c = 1^/2*. 
i 

If a and b are randomly chosen, one should expect that : 

Xai s -2. and =-j. 
i i" 
Given the fact that : 

w(a - c) + w(b - c) + w(c) £ w(a) + vv(fc) 

(where w denotes the Hamming weight), our strategy consists in 
computing : 

G a = g a ® c mod p 
< Q b = g&ec m od p 
G c = g c modp 

to obtain 
r a - G a G c mod p- g a mod p 
= GfcG c mod /? = gb mod/? 

The gain for a set of N signatures is therefore statistically N(n/4 - 
1) multiplications, which tends to 25% of the total multiplications 
required to generate a set of N signatures with the parallel square- 
&~ multiply, if we simply apply this strategy to all signatures 
grouped by pairs as illustrated in the following table : 



operations 


PSM 


Exponent Combination 


r a 


= n/2-l 


= n/2-n/4-l 




an/2-1 


sn/2-rt/4-l 


g c 


none 


= n/4-l 


Total 


= n-2 


= 3«/4-3 



Table 1 : PSM and batch exponentiation performance 



The computational effort required to generate the set 
G = {s*i mod p\i = \ Af} 

is about 7v*/2(3/i/4-3))+n/Jv* but implies to use 3N/2 sizeip)- 
bit registers which may not be practical in some situations. In the 
following we will present an optimization of our batch strategy 
and achieve comparison with Brickell, Gordon, McCurley and 
Wilson algorithm in [1], exhibiting when our strategy is more 
convenient and suitable. 



3. Improvement and Performance 

3.1 BGCW algorithm 

The precomputation technique proposed by Brickell and al. in [1] 
is based on the following observation : in [1 1] it was proposed to 
precompute the set of g z *to reduce the computational effort 
increasing the storage amount. There is no reason to consider 
powers of 2. 

The idea is to find a decomposition 

m-1 

/=0 

where 0 £ a j < h for 0 ^ i £ m, then we can compute 
where cd=T\ ai =d8 x i * 

The algorithm directly derived from this achieves the computation 
of g K with only m+h-2 multiplications, but required that the 
values of g x i mod p have been previously stored. To give an 
overview of the performances of this algorithm, one can remark 
that to generate a g K with a 160-bit *, say for a Schnorr or DSS 
signature, using a base 16 representation for the numbers, the 
BGCW algorithm produces g k at the cost of only 50 
multiplications but requires also the storage of 40 /z-bit numbers. 
Considering a base 32 notation, one can save a few storage (about 
2 Kbytes rather than 2.5 previously) but the number of 
multiplications grows to 60. An improvement based on the notion 
of a basic digit set improves drastically the speed performance 
since the number of multiplications falls to about 36 but implies 
the storage of more than 200 numbers. The main drawback of the 
method is clearly the minimum storage capacity needed to achieve 
an exponentiation. 

3.2 Optimizing the exponent combination 

The total number of computations required to produce the set of 
g K s will depend mainly on the multiplications to be calculated 
since the squarings are done once for all. 

Considering that we want to group a n-bit exponents from a large 
set of AT values together rather than only joining them by pairs, the 
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main issue is to find the best combination strategy to reduce the 

multiplications to be done. 

The number of multiplications per g K is in average 



and the squaring effort can be divided between all the ks. 

The analysis of the function representation (see Annex 1) for 
usual exponent lengths (160-bit and 512-bit) give integer 
solutions which minimize this quantity. The number of registers 
required to achieve the computation increasing with the number of 
elements in the set (since to compute squarings once we must 
calculate all the g K s together), the best values appear to be 4 for 
160-bit values and 5 for 512-bit values. The final choice for a 
dedicated computation relies mainly on the memory capacity of 
the machine which generates the g K s. 

33 Performance Overview 

The various strategies achieving several time and memory trade- 
offs are to be considered. The Tables 1 and 2 give the different 
trade-offs for 160-bit (Schnorr, DSS) and 512-bit (El- Gamal, 
Brickell-McCurley) exponents, considering that p is a 512-bit 
prime modulus. 



Set 
Size 


Subset 
Size 


Storage (Bytes) 


^Multiplications per 
computation 


2 


2 


316 


141 


3 


3 


652 


103 


4 


2 


568 


101 


4 


4 


1,324 


84.5 


12 


3 


2.416 


63 


12 


4 


3.844 


57.83 


36 


3 


7,120 


54.11 


36 


4 


11,404 


48.94 


108 


4 


34,084 


45.98 


Table 2 : performances with 160-bit exponent 


Set 
Size 


Subset 
Size 


Storage (Bytes) 


^Multiplications per 
computation 


2 


2 


448 


449 


3 


3 


960 


323 


4 


4 


1.984 


255 


5 


5 


4,032 


216.6 


60 


3 


17,984 


160.87 


60 


4 


28,864 


135.53 


60 


5 


47,680 


122.73 


200 


5 


158.784 


116.76 



Table 3 : performances with 512-bit exponent 



Compare to the Squares-Multiply and BGCW algorithms, the 
Batch Exponentiation strategy provides nice improvements to 
save computation and memory. The pairs grouping provides a 42 
% gain on the total computation at the cost of only 2 n-bit 
registers, while the BGCW implies at least a 2 Kbytes storage. The 
technique is perfectly suited to low-memory environment where 
memory cost is high; furthermore, the exponent re-combination is 
easy to perform on any machine. On the other hand, BGCW 
algorithm is very efficient when at least a few Kbytes of 
permanent memory are available. 



An adaptation of Batch Exponentiation tailored for high-speed 
transmission (see Annex 2) even provides greater improvements 
by sub-contracting the squaring effort to an external device 
assuming nothing on his security. . 



4. Conclusion, Extensions and Open Questions 

We presented a strategy which can accelerate the generation of 
DLP-based signatures. The main characteristics of the batch- 
exponentiation technique presented in this article are summarized 
in the following table. 



scheme** 
efforts 


Batch 
(memory) 


Batch 
(time) 


Schnorr 


141 N 


45.98 N 


El-Gamal 


449 N 


116.76 7V 



Table 4 : exponentiation performance 



Several open questions appear interesting to explore to further 
improve the proposed strategies : 

• For a power k e l g , try to find a such that k' = aq + k where 
the hamming weight of k* is significatively small. Since 
computations are done modulo q, this transformation of k 
does not have any impact on the result itself but may well 
reduce the computation workload. 

• Find an algorithm such that the construction of the subsets is 
optimal, that is the ordering of the its results in as few 
computations as possible. 
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Figure 2 : 512-bit exponent - Best integer solution = 5 
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Sub-contracting squarings with high-speed 
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Figure 1 : 160-bit exponent - Best integer solution = 4 



Using a high-speed transmission interface such that of a PCMCIA 
card, one can subcontract the square computation and rely on the 
device with the same level of security. 



The strategy remains the same, grouping the exponents and 
computing the g k 's, except that the squaring are computed by a 
genuine device, assuming nothing on his tamper-resistance. The 
device that shall compute values has a certificate Con the set 

of m od p | i i £ «?e(p)J such as an iterative hashing of the 

whole set. 

Denoting Sender the computing device in charge of the squaring 
effort and Receiver the exponentiation machine, the protocol is 
the following : 



Sender 



Receiver 



s = g 



For < = 0 to n-1 
Send s 



/ = 0 

Receive s 
(use if needed) 

s = mod p 

/ = 5/M(/)j) 

If C=/ accept 
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Abstract. Smartcards are the most secure portable computing 
device today. They have been used successfully in applications 
involving money, and proprietary and personal data (such as 
banking, healthcare, insurance, etc.). As smartcards get more 
powerful (with 32-bit CPU and more than 1 MB of stable 
memory in the next versions) and become multi-application, 
the need for database management arises. However, smart- 
cards have severe hardware limitations (very slow write, very 
little RAM, constrained stable memory, no autonomy, etc.) 
which make traditional database technology irrelevant. The 
major problem is scaling down database techniques so they 
perform well under these limitations. In this paper, we give an 
in-depth analysis of this problem and propose a PicoDBMS 
solution based on highly compact data structures, query ex- 
ecution without RAM, and specific techniques for atomicity 
and durability. We show the effectiveness of our techniques 
through performance evaluation. 

Key words: Smartcard applications - PicoDBMS - Storage 
model - Execution model - Query optimization - Atomicity 
- Durability 



1 Introduction 

Smartcards are the most secure portable computing device to- 
day. The first smartcard was developed by Bull for the French 
banking system in the 1980s to significantly reduce the losses 
associated with magnetic stripe credit card fraud. Since then, 
smartcards have been used successfully around the world in 
various applications involving money, proprietary data, and 
personal data (such as banking, pay-TV or GSM subscriber 
identification, loyalty, healthcare, insurance, etc.). While to- 
day's smartcards handle a single issuer-dependent application, 
the trend is toward multi-application smartcards 1 . Standards 
for multi-application support, like the JavaCard [36] and Mi- 
crosoft's SmartCard for Windows [26], ensure that the card 
be universally accepted and be able to interact with several 



service providers. This should make smartcards one of the 
world's highest-volume markets for semiconductors [14]. 

As smartcards become more and more versatile, multi- 
application, and powerful (32-bit processor, more than 1 MB 
of stable storage), the need for database techniques arises. Let 
us consider a health card storing a complete medical folder 
including the holder's doctors, blood type, allergies, prescrip- 
tions, etc. The volume of data can be important and the queries 
fairly complex (select, join, aggregate). Sophisticated access 
rights management using views and aggregate functions are re- 
quired to preserve the holder's data privacy. Transaction atom- 
icity and durability are also needed to enforce data consistency. 
More generally, database management helps to separate data 
management code from application code, thereby simplifying 
and making application code smaller. Finally, new applica- 
tions can be envisioned, like computing statistics on a large 
number of cards, in an asynchronous and distributed way. Sup- 
porting database management on the card itself rather than on 
an external device is the only way to achieve very high secu- 
rity, high availability (anywhere, anytime, on any terminal), 
and acceptable performance. 

However, smartcards have severe hardware limitations 
which stem from the obvious constraints of small size (to 
fit on a flexible plastic card and to increase hardware se- 
curity) and low cost (to be sold in large volumes). Today's 
microcontrollers contain a CPU, memory - including about 
96 kB of ROM, 4 kB of RAM, and up to 128 kB of stable stor- 
age like EEPROM - and security modules [39]. EEPROM 
is used to store persistent information; it has very fast read 
time (60-1 00 ns) comparable to old-fashion RAM but very 
slow write time (more than 1 ms/word). Following Moore's 
law for processor and memory capacities, smartcards will get 
rapidly more powerful. Existing prototypes, like Gemplus's 
Pinocchio card [16], bypass the current memory bottleneck 
by connecting an additional chip of 2MB of Flash memory to 
the microcontroller. Although a significant improvement over 
today's cards, this is still very restricted compared to other 
portable, less secure, devices such as Personal Digital Assis- 
tants (PDA). Furthermore, smartcards are not autonomous, 
i.e., have no independent power supply, thereby precluding 
asynchronous and disconnected processing. 



Everyone would probably enjoy carrying far fewer cards. 
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These limitations (tiny RAM, little stable storage, very 
costly write, and lack of autonomy) make traditional database 
techniques irrelevant Typically, traditional DBMS exploit sig- 
nificant amounts of RAM and use caching and asynchronous 
I/Os to reduce disk access overhead as much as possible. With 
the extreme constraints of the smartcard, the major problem is 
scaling down database techniques. While there has been much 
excellent work on scaling up to deal with very large databases, 
e.g., using parallelism, scaling down has not received much at- 
tention by the database research community. However, scaling 
down in general is becoming very important for commodity 
computing and is quite difficult [18]. 

Some DBMS designs have addressed the problem of scal- 
ing down. Light versions of popular DBMS like Sybase Adap- 
tive Server Anywhere [37], Oracle 8i Lite [30] or DB2 Every- 
where [20] have been primarily designed for portable comput- 
ers and PDA. They have a small footprint which they obtain 
by simplifying and componentizing the DBMS code. How- 
ever, they use relatively high RAM and stable memory and do 
not address the more severe limitations of smartcards. ISOL's 
SQLJava Machine DBMS [13] is the first attempt towards a 
smartcard DBMS while SCQL [24], the standard for smartcard 
database language, emerges. While both designs are limited to 
single select, they exemplify the strong interest for dedicated 
smartcard DBMS. 

In this paper, we address the problem of scaling down 
database techniques and propose the design of what we call a 
PicoDBMS. This work is done in the context of a new project 
with Bull Smart Cards and Terminals. The design has been 
made with smartcard applications in mind, but its scope ex- 
tends as well to any ultra-light computer device based on a 
secured monolithic chip. This paper makes the following con- 
tributions: 

• We analyze the requirements for a PicoDBMS based on a 
typical healthcare application and justify its minimal func- 
tionality. 

• We give an in-depth analysis of the problem by considering 
the smartcard hardware trends and derive design principles 
for a PicoDBMS. 

• We propose a new pointer-based storage model that inte- 
grates data and indices in a unique compact data structure. 

• We propose query execution techniques which handle 
complex query plans (including joins and aggregates) with 
no RAM consumption. 

• We propose transaction techniques for atomicity and dura- 
bility that reduce the logging cost to its lowest bound and 
enable a smartcard to participate in distributed transac- 
tions. 

• We show the effectiveness of each technique through per- 
formance evaluation. 

This paper is an extended version of [7]. In particular, the 
section on transaction management is new. The paper is orga- 
nized as follows. Section 2 illustrates the use of take-away 
databases in various classes of smartcard applications and 
presents in more detail the requirements of the health card 
application. Section 3 analyzes the smartcard hardware con- 
straints and gives the problem definition. Sections 4-6 present 
and assess the PicoDBMS' storage model, query execution 
model, and transaction model, respectively. Section 7 con- 
cludes. 



2 Smartcard applications 

In this section, we discuss the major classes of emerging smart- 
card applications and their database requirements. Then, we 
illustrate these requirements in further detail with the health 
card application, which we will use as reference example in 
the rest of the paper. 

2.1 Database management requirements 

Table 1 summarizes the database management requirements 
of the following typical classes of smartcard applications: 

• Money and identification: examples of such applications 
are credit cards, e-purse, SIM for GSM, phone cards, trans- 
portation cards. They are representative of today's applica- 
tions, with very few data (typically the holder's identifier 
and some status information). Querying is not a concern 
and access rights are irrelevant since cards are protected by 
PIN-codes. Their unique database management require- 
ment is update atomicity. 

• Downloadable databases: these are predefined packages 
of confidential data (e.g., diplomatic, military or business 
information) that can be downloaded on the card - for ex- 
ample, before traveling - and be accessed from any termi- 
nal. Data availability and security are the major concerns 
here. The volume of data can be important and the queries 
complex. The data are typically read-only. 

• User environment: the objective is to store in a smartcard 
an extended profile of the card's holder including, among 
others, data regarding the computing environment (PC's 
configuration, passwords, cookies, bookmarks, software 
licenses, etc.), an address book as well as an agenda. The 
user environment can thus be dynamically recovered from 
the profile on any terminal. Queries remain simple, as data 
are not related. However, some of the data are highly pri- 
vate and must be protected by sophisticated access rights 
(e.g., the card's holder may want to share a subset of her/his 
address book or bookmark list with a subset of persons). 
Transaction atomicity and durability are also required. 

• Personal folders: personal folders may be of a different na- 
ture: scholastic, healthcare, car maintenance history, loy- 
alty. They roughly share the same requirements, which 
we illust ate next with the healthcare example. Note that 
queries involving data issued from different folders can 
make sense. For instance, one may be interested in discov- 
ering associations between some disease and the scholastic 
level of the card holder. This raises the interesting issue of 
maintaining statistics on a population of cards or mining 
their content asynchronously. 



2.2 The health card application 

The health card is very representative of personal folder appli- 
cations and has strong database requirements. Several coun- 
tries (France, Germany, USA, Russia, Korea, etc.) are devel- 
oping healthcare applications on smartcards [11]. The initial 
idea was to give to each citizen a smartcard containing her/his 
identification and insurance data. As smartcard storage ca- 
pacity increases, the information stored in the card can be 
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extended to the holder's doctors, emergency data (blood type, 
allergies, vaccination, etc.), surgical operations, prescriptions, 
insurance data and even links to heavier data (e.g. , X-ray exam- 
ination, scanner images, etc.) stored on hospital servers. Dif- 
ferent users may query, modify, and create data in the holder's 
folder: the doctors who consult the patient's past records and 
prescribe drugs, the surgeons who perform exams and opera- 
tions, the pharmacists who deliver drugs, the insurance agents 
who refund the patient, public organizations which maintain 
statistics or study the impact of drugs correlation in population 
samples, and finally the holder her/himself. 

We can easily observe that: (i) the amount of data is sig- 
nificant (more in terms of cardinality than in terms of volume 
because most data can be encoded); (ii) queries can be rather 
complex (e.g., a doctor asks for the last antibiotics prescribed 
to the patient); (iii) sophisticated access rights management 
using views and aggregate functions are highly required (e.g., 
a statistical organization may access aggregate values only but 
not the raw data); (iv) atomicity must be preserved (e.g., when 
the pharmacist delivers drugs); and (v) durability is manda- 
tory, without compromising data privacy (logged data stored 
outside the card must be protected). 

One may wonder whether the holder's health data ought 
to be stored in a smartcard or in a centralized database. The 
benefit of distributing the healthcare database on smartcards 
is threefold. First, health data must be made highly available 
(anywhere, anytime, on any terminal, and without requiring 
a network connection). Second, storing sensitive data on a 
centralized server may damage privacy. Third, maintaining a 
centralized database is fairly complex due to the variety of 
data sources. Assuming the health data is stored in the smart- 
card, the next question is why the aforementioned database 
capabilities need to be hosted in the smartcard rather than the 
terminals. The answer is again availability (the data must be 
exploited on any terminal) and privacy. Regarding privacy, 
since the data must be confined in the chip, so must the query 
engine and the view manager. As the smartcard is the unique 
trusted part of the system, access rights and transaction man- 
agement cannot be delegated to an untrusted terminal. 



3 Problem formulation 

In this section, we make clear the smartcard constraints in 
order to derive design rules for the PicoDBMS and state the 
problem. Our analysis is based on the characteristics of both 



existing smartcard products and current prototypes [16, 39], 
and thus, should be valid for a while. We also discuss how the 
main constraints of the smartcard will evolve in a near future. 



3. 1 Smartcard constraints 

Current smartcards include in a monolithic chip, a 32 bits RISC 
processor at about 30 MIPS, memory modules (of about 96 kB 
of ROM, 4 kB of static RAM, and 128 kB of EEPROM), secu- 
rity components (to prevent tampering), and take their electri- 
cal energy from the terminal [39], ROM is used to store the op- 
erating system, the JavaCard virtual machine, fixed data, and 
standard routines. RAM is used as working memory for main- 
taining an execution stack and calculating results. EEPROM 
is used to store persistent information. EEPROM has very fast 
read time (60-100ns/word) comparable to old-fashion RAM, 
but a dramatically slow write time (more than 1 ms/word). 

The main constraints of current smartcards are therefore: 
(i) the very limited storage capacity; (ii) the very slow write 
time in EEPROM; (iii) the extremely reduced size of the RAM; 
(iv) the lack of autonomy; and (v) a high security level that 
must be preserved in all situations. These constraints strongly 
distinguish smartcards from any other computing devices, in- 
cluding lightweight computers like PDA. 

Let us now consider how hardware advances can impact on 
these constraints, in particular, memory size. Current smart- 
cards rely on a well-established and slightly out-of-date hard- 
ware technology (0.35/mi) in order to minimize the production 
cost (less than five dollars) and increase security [34]. Further- 
more, up to now, there was no real need for large memories 
in smartcard applications such as the holder's identification. 
According to major smartcard providers, the market pressure 
generated by emerging large storage demanding applications 
will lead to a rapid increase of the smartcard storage capac- 
ity. This evolution is however constrained by the smartcard 
tiny die size fixed to 25 mm 2 in the ISO standard [23], which 
pushes for more integration. This limited size is due to security 
considerations (to minimize the risk of physical attack [5]) and 
practical constraints (e.g., the chip should not break when the 
smartcard is flexed). Another solution to relax the storage limit 
is to extend the smartcard storage capacity with external mem- 
ory modules. This is being done by Gemplus which recently 
announced Pinocchio [16], a smartcard equipped with 2MB 
of Flash memory linked to the microcontroller by a bus. Since 
hardware security can no longer be provided on this memory, 
its content must be either non-sensitive or encrypted. 

Another important issue is the performance of stable mem- 
ory. Possible alternatives to the EEPROM are Flash memory 
and Ferroelectric RAM (FeRAM) [15] (see Table 2 for perfor- 
mance comparisons). Flash is more compact than EEPROM 
and represents a good candidate for high capacity smartcards 
[16]. However, Flash banks need to be erased before writing, 
which is extremely slow. This makes Flash memory appro- 
priate for applications with a high read/write ratio (e.g., ad- 
dress books). FeRAM is undoubtedly an interesting option for 
smartcards as read and write times are both fast. Although its 
theoretical foundation was set in the early 1950s, FeRAM is 
just emerging as an industrial solution. Therefore, FeRAM is 
expensive, less secure than EEPROM or Flash, and its integra- 
tion with traditional technologies (such as CPUs) remains an 
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Table 2. Performance of stable memories for the smartcard 
Memory type EEPROM FLASH FeRAM 



Read time (/word) 60 to 150 ns 

Write time (/word) 1 to 5 ms 

Erase time (/bank) None 

Lifetime ( *> (/cell) 10 5 write 
cycles 



70 to 200 ns 150 to 200 ns 

5 to 10/zs 150 to 200ns 

500too00ms None 

10 5 erase 10 10 to 10 12 

cycles write cycles 



* A memory cell can be overwritten a finite number of time. 

issue. Thus FeRAM could be considered a serious alternative 
only in the very long term [15]. 

Given these considerations, we assume in this paper 
a smartcard with a reasonable stable storage area (a few 
megabytes of EEPROM 2 ) and a small RAM area (some kilo- 
bytes). Indeed, there is no clear interest in having a large RAM 
area, given that the smartcard is not autonomous, thus pre- 
cluding asynchronous write operations. Moreover, more RAM 
means less EEPROM as the chip size is limited. 

5.2 Impact on the PicoDBMS architecture 

We now analyze the impact of the smartcard constraints on 
the PicoDBMS architecture, thus justifying why traditional 
database techniques, and even lightweight DBMS techniques, 
are irrelevant. The smartcard's properties and their impact are: 

• Highly secure: smartcard's hardware security makes it the 
ideal storage support for private data. The PicoDBMS must 
contribute to the data security by providing access right 
management and a view mechanism that allows complex 
view definitions (i.e., supporting data composition and ag- 
gregation). The PicoDBMS code must not present security 
holes due to the use of sophisticated algorithms 3 . 

• Highly portable: the smartcard is undoubtedly the most 
portable personal computer (the wallet computer). The 
data located on the smartcard are thus highly available. 
They are also highly vulnerable since the smartcard can 
be lost, stolen or accidentally destroyed. The main conse- 
quence is that durability cannot be enforced locally. 

• Limited storage resources: despite the foreseen increase 
in storage capacity, the smartcard will remain the lightest 
representative of personal computers for a long time. This 
means that specific storage models and execution tech- 
niques must be devised to minimize the volume of per- 
sistent data (i.e., the database) and the memory consump- 
tion during execution. In addition, the functionalities of 
the PicoDBMS must be carefully selected and their im- 
plementation must be as light as possible. The lightest the 
PicoDBMS, the biggest the onboard database. 

• Stable storage is main memory : smartcard stable memory 
provides the read speed and direct access granularity of a 
main memory. Thus, a PicoDBMS can be considered as a 
main memory DBMS (MMDBMS). However the dramatic 
cost of writes distinguishes a PicoDBMS from a tradi- 
tional MMDBMS. This impacts on the storage and access 

2 Considering Flash instead of EEPROM will not change our con- 
clusions. It will just exacerbate them. 

3 Most security holes are the results of software bugs [34]. 



methods of the PicoDBMS as well as the way transaction 
atomicity is achieved. 

• Non-autonomous: compared to other computers, the 
smartcard has no independent power supply, thereby pre- 
cluding disconnected and asynchronous processing. Thus, 
all transactions must be completed while the card is in- 
serted in a terminal (unlike PDA, write operations cannot 
be cached in RAM and reported on stable storage asyn- 
chronously). 

33 Problem statement 

To summarize, our goal is to design a PicoDBMS including 
the following components: 

• Storage manager : manages the storage of the database and 
the associated indices. 

• Query manager : processes query plans composed of se- 
lect, project, join, and aggregates. 

• Transaction manager: enforces the ACID properties and 
participates to distributed transactions. 

• Access right manager: provides access rights on base data 
and on complex user-defined views. 

Thus, the PicoDBMS hosted in the chip provides the min- 
imal subset of functionality that is strictly needed to manage 
in a secure way the data shared by all onboard applications. 
Other components (e.g., the GUI, a sort operator, etc.) can be 
hosted in the terminal or be dynamically downloaded when 
needed, without threatening security. In the rest of this pa- 
per, we concentrate on the components which require non- 
traditional techniques (storage manager, query manager, and 
transaction manager) and ignore the access right manager for 
which traditional techniques can be used. 

When designing the PicoDBMS 's components, we must 
follow several design rules derived from the smartcard's prop- 
erties: 

• Compactness rule: minimize the size of data structures 
and the PicoDBMS code to cope with the limited stable 
memory area (a few megabytes). 

• RAM rule: minimize the RAM usage given its extremely 
limited size (some kilobytes). 

• Write rule: minimize write operations given their dramatic 
cost (« 1 ms/word). 

• Read rule: take advantage of the fast read operations (« 
lOOns/word). 

• Access rule: take advantage of the low granularity and 
direct access capability of the stable memory for both read 
and write operations. 

• Security rule: never externalize private data from the chip 
and minimize the algorithms' complexity to avoid security 
holes. 



4 PicoDBMS storage model 

In this section, following the design rules for a PicoDBMS, we 
discuss the storage issues and propose a very compact model 
based on a combination of flat storage, domain storage, and 
ring storage. We also evaluate the storage cost of our storage 
model. 
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4.1 Flat storage 

The simplest way to organize data is flat storage (FS), where 
tuples are stored sequentially and attribute values are embed- 
ded in the tuples. Although it does not impose it, the SCQL 
standard [24] considers FS as the reference storage model for 
smartcards. The main advantage of FS is access locality. How- 
ever, in our context, FS has two main drawbacks: 

• Space consuming: while normalization rules preclude at- 
tributes conjunction redundancy to occur, they do not 
avoid attribute value duplicates (e.g., the attribute Doc- 
torSpecialty may contain many duplicates). 

• Inefficient: in the absence of index structures, all opera- 
tions are computed sequentially. While this is convenient 
for old fashion cards (some kilobytes of storage and a 
mono-relation select operator), this is no longer accept- 
able for future cards where storage capacity is likely to 
exceed 1 MB and queries can be rather complex. 

Adding index structures to FS may solve the second prob- 
lem while worsening the first one. Thus, FS alone is not ap- 
propriate for a PicoDBMS. 



4.2 Domain storage 

Based on the critique of FS, it follows that a PicoDBMS stor- 
age model should guarantee both data and index compactness. 
Let us first deal with data compactness. Since locality is no 
longer an issue in our context, pointer-based storage models 
inspired by MMDBMS [3, 27, 31] can help reducing the data 
storage cost. The basic idea is to preclude any duplicate value 
from occuring. This can be achieved by grouping values in 
domains (sets of unique values). We call this model domain 
storage (DS). As shown in Fig. 1, tuples reference their at- 
tribute values by means of pointers. Furthermore, a domain 
can be shared among several attributes. This is particularly 
efficient for enumerated types, which vary on a small and de- 
termined set of values 4 . 

One may wonder about the cost of tuple creation, update, 
and deletion since they may generate insertion and deletion 
of values in domains. While these actions are more complex 
than their FS counterpart, their implementation remains more 
efficient in the smartcard context, simply because the amount 
of data to be written is much smaller. To amortize the slight 
overhead of domain storage, we only store by domain all large 
attributes (i.e., greater than a pointer size) containing dupli- 
cates. Obviously, attributes with no duplicates (e.g., keys) need 

4 Compression techniques can be advantageously used in conjunc- 
tion with DS to increase compactness [17]. 
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Fig. 2. Ring storage 



not be stored by domain but with FS. Variable-size attributes 
- generally larger than a pointer - can also be advantageously 
stored in domains even if they do not contain duplicates. The 
benefit is not storage savings but memory management sim- 
plicity (all tuples of all relations become fixed-size) and log 
compactness (see Sect. 6). 



4.3 Ring storage 

We now address index compactness along with data compact- 
ness. Unlike disk-based DBMS that favor indices which pre- 
serve access locality, smartcards should make intensive use 
of secondary (i.e., pointer-based) indices. The issue here is to 
make these indices as compact as possible. Let us first consider 
select indices. A select index is typically made of two parts: a 
collection of values and a collection of pointers linking each 
value to all tuples sharing it. Assuming the indexed attribute 
varies on a domain, the index's collection of values can be 
saved since it exactly corresponds to the domain extension. 
The extra cost incurred by the index is then reduced to the 
pointers linking index values to tuples. 

Let us go one step further and get these pointers almost for 
free. The idea is to store these value-to-tuple pointers in place 
of the tuple-to-value pointers within the tuples (i.e., pointers 
stored in the tuples to reference their attribute values in the 
domains). This yields to an index structure which makes a ring 
from the domain values to the tuples. Hence, we call it ring 
index (see Fig. 2a). However, the ring index can also be used to 
access the domain values from the tuples and thus serve as data 
storage model. Thus we call ring storage (RS) the storage of 
a domain-based attribute indexed by a ring. The index storage 
cost is reduced to its lowest bound, that is, one pointer per 
domain value, whatever the cardinality of the indexed relation. 
This important storage saving is obtained at the price of extra 
work for projecting a tuple to the corresponding attribute since 
retrieving the value of a ring stored attribute means traversing 
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on average half of the ring (i.e., up to reaching the domain 
value). 

Join indices [40] can be treated in a similar way. A join 
predicate of the form (R.a = 5.6) assumes that R.a and 5.6 
vary on the same domain. Storing both R.a and 5.6 by means 
of rings leads to defining a join index. In this way, each domain 
value is linked by two separate rings to all tuples from R and 
5 sharing the same join attribute value. However, most joins 
are performed on key attributes, R.a being a primary key and 
5.6 being the foreign key referencing R.a. In our model, key 
attributes are not stored by domain but with FS. Nevertheless, 
since R.a is the primary key of R, its extension forms precisely 
a domain, even if not stored outside of R. Since attributes 5.6 
take their values in R.a*s domain, they reference R.a values 
by means of pointers. Thus, the domain-based storage model 
naturally implements for free a unidirectional join index from 
5.6 to R.a (i.e., each 5 tuple is linked by a pointer to each 
R tuple matching with it). If traversals from R.a to 5.6 need 
to be optimized too, a bi-directional join index is required. 
This can be simply achieved by defining a ring index on 5.6. 
Figure 2b shows the resulting situation where each R tuple is 
linked by a ring to all 5 tuples matching with it and vice versa. 
The cost of a bi-directional join index is restricted to a single 
pointer per R tuple, whatever the cardinality of 5. Note that 
this situation resembles the well-known Codasyl model. 



4.4 Storage cost evaluation 

Our storage model combines FS, DS, and RS. Thus, the issue 
is to determine the best storage for each attribute. If the at- 
tributes need not be indexed, the choice is obviously between 
FS and DS. Otherwise, the choice is between RS and FS with a 
traditional index. Thus, we compare the storage cost for a sin- 
gle attribute, indexed or not, for each alternative. We introduce 
the following parameters: 

• CardRel: cardinality of the relation holding the attribute. 

• a: average length of the attribute (expressed in bytes). 

• p: pointer size (3 bytes will be required to address "large" 
memory of future cards). 

• 5: selectivity factor of the attribute. 5 = Card- 
DomlCardRel, where CardDom is the cardinality of the 
attribute domain extension (in all models). 5 measures the 
redundancy of the attribute (i.e., the same attribute value 
appears in 1/5 tuples). 



Cost(FS) = 
Cost(DS) = 



CardRel*a 



CardRel*p 



+ S*CanlRel*a 



Cost(Indexed-FS) = Cost(FS) 
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3. Storage models' tradeoff 



The cost equality between FS and DS gives: 5 = (a-p) / a. 
The cost equality between IndexecLFS and RS gives: 

5 = a/p 

Figure 3a shows the different values of 5 and a for which 
FS and DS are equivalent. Thus, each curve divides the plan 
into a gain area for FS (above the curve) and a gain area for DS 
(under the curve). For values of a less than 3 (i.e., the size of a 
pointer), FS is obviously always more compact than DS. For 
higher values of a, DS becomes rapidly more compact than FS 
except for high values of 5. For instance, considering 5 = 0.5, 
that is the same value is shared by only two tuples, DS out- 
performs FS for all a larger than 6 bytes. The highera and the 
lower 5, the better DS. The benefit of DS is thus particularly 
important for enumerated type attributes. Figure 3b compares 
Indexed_FS with RS. The superiority of RS is obvious, except 
for 1- and 2-byte-Iong key attributes. Thus, Figs. 3a and 3b 
are guidelines for the database designer to decide how to store 
each attribute, by considering its size and selectivity. 



5 Query processing 

Traditional query processing strives to exploit large main 
memory for storing temporary data structures (e.g., hash ta- 
bles) and intermediate results. When main memory is not large 
enough to hold some data, state-of-the-art algorithms (e.g., hy- 
brid hash join [33]) resort to materialization on disk to avoid 
memory overflow. These algorithms cannot be used for a Pi- 
coDBMS because: 

• Given the write rule and the lifetime of stable memory, 
writes in stable memory are proscribed, even for temporary 
materialization. 



126 



P. Pucheral et al.: PicoDBMS: Scaling down database techniques for the smartcard 



• Dedicating a specific RAM area does not help since we 
cannot estimate its size a priori. Making it small increases 
the risk of memory overflow, thereby leading to writes in 
stable memory. Making it large reduces the stable memory 
area, already limited in a smartcard (RAM rule). More- 
over, even a large RAM area cannot guarantee that query 
execution will not produce memory overflow [9]. 

• State-of-the-art algorithms are quite sophisticated, which 
precludes their implementation in a PicoDBMS whose 
code must be simple, compact, and secure (compactness 
and security rules). 

To solve this problem, we propose query processing tech- 
niques that do not use any working RAM area nor incur any 
writes in stable memory. In the following, we describe these 
techniques for simple and complex queries, including aggre- 
gation and remove duplicates. We show the effectiveness of 
our solution through a performance analysis. 



5. J Basic query execution without RAM 

We consider the execution of SPJ (Select/Project/Join) 
queries. Query processing is classically done in two steps. The 
query optimizer first generates an "optimal" query execution 
plan (QEP). The QEP is then executed by the query engine 
which implements an execution mode! and uses a library of 
relational operators [17]. The optimizer can consider differ- 
ent shapes of QEP: left-deep, right-deep or bushy trees (see 
Fig. 4). In a left-deep tree, operators are executed sequentially 
and each intermediate result is materialized. On the contrary, 
right-deep trees execute operators in a pipeline fashion, thus 
avoiding intermediate result materialization. However, they 
require materializing in memory all left relations. Bushy trees 
offer opportunities to deal with the size of intermediate results 
and memory consumption [38]. 

In a PicoDBMS, the query optimizer should not consider 
any of these execution trees as they incur materialization. The 
solution is to only use pipelining with extreme right-deep trees 
where all the operators (including select) are pipelined. As left 
operands are always base relations, they are already materi- 
alized in stable memory, thus allowing us to execute a plan 
with no RAM consumption. Pipeline execution can be easily 
achieved using the well-known Iterator Model [17]. In this 
model, each operator is an iterator that supports three proce- 
dure calls: open to prepare an operator for producing an item, 
next to produce an item, and close to perform final clean-up. 
A QEP is activated starting at the root of the operator tree a.nd 
progressing towards the leaves. The dataflow in the model is 
demand-driven: a child operator passes a tuple to its parent 
node in response to a next call from the parent. 

Let us now detail how select, project, and join are per- 
formed. These operators can be executed either sequentially 
or with a ring index. Given the access rule, the use of indices 
seems always to be the right choice. However, extreme right- 
deep trees allow us to speed-up a single select on the first base 
relation (e.g., Drug.type in our example), but using a ring in- 
dex on the other selected attributes (e.g., Visit.date) may slow 
down execution as the rings need to be traversed to retrieve 
their value. Project operators are pushed up to the tree since 
no materialization occurs. Note that the final project incurs 
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Fig. 4. Several execution trees for query Ql 

an additional cost in case of ring attributes. Without indices, 
joining relations is done by a nested-loop algorithm since no 
other join technique can be applied without ad hoc structures 
(e.g., hash tables) and/or working area (e.g., sorting). The cost 
of indexed joins depends on the way indices are traversed. 
Consider the indexed join between Doctor (ntuples) and Visit 
(m tuples) on their key attribute. Assuming a unidirectional 
index, the join cost is proportional to n * m starting with Doc- 
tor and to m starting with Visit. Assuming now a bi-directional 
index, the join cost becomes proportional to n + m starting 
with Doctor and to m 2 /2n starting with Visit (retrieving the 
doctor associated to each visit incurs traversing half of a ring 
in average). In the latter case, a nai ve nested loop join can be 
more efficient if the ring cardinality is greater than the tar- 
get relation cardinality (i.e., when m > n 2 ). In that case, the 
database designer must clearly choose a unidirectional index 
between the two relations. 



5.2 Complex query execution without RAM 

We now consider the execution of aggregate, sort, and du- 
plicate removal operators. At first glance, pipeline execution 
is not compatible with these operators which are classically 
performed on materialized intermediate results. Such materi- 
alization cannot occur either in the smartcard due to the RAM 
rule or in the terminal due to the security rule. Note that sort- 
ing can be done in the terminal since the output order of the 
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Fig. 5. Four 'complex' query execution plans 

result tuples is not significant, i.e., depends on the DBMS al- 
gorithms. 

We propose a solution to the above problem by exploiting 
two properties: (i) aggregate and duplicate removal can be 
done in pipeline if the incoming tuples are still grouped by 
distinct values; and (ii) pipeline operators are order-preserving 
since they consume (and produce) tuples in the arrival order. 
Thus, enforcing an adequate consumption order at the leaf of 
the execution tree allows pipelined aggregation and duplicate 
removal. For instance, the extreme right-deep tree of Fig. 4 
delivers the tuples naturally grouped by Drug.id, thus allowing 
group queries on that attribute. 

Let us now consider query Q2 of Fig. 5. As pictured, exe- 
cuting Q2 in pipeline requires rearranging the execution tree 
so that relation Doctor is explored first. Since Doctor contains 
distinct doctors, the tuples arriving to the count operator are 
naturally grouped by doctors. 

The case of Q3 is harder. As the data must be grouped 
by type of drugs rather than by Drug.id, an additional join is 
required between relation Drug and domain drug.type. Do- 
main values being unique, this join produces the tuples in the 
adequate order. If domain Drug.type does not exist, an opera- 
tor must be introduced to sort relation Drug in pipeline. This 
can be done by performing n passes on Drug where n is the 
number of distinct values of Drug.type. 



The case of Q4 is even trickier. The result must be grouped 
on two attributes (Doctor.id and Drug.type), introducing the 
need to start the tree with both relations! The solution is to 
insert a Cartesian product operator at the leaf of the tree in 
order to produce tuples ordered by Doctor.id and Drug.type. 
In this particular case, the query response time should be ap- 
proximately n times greater than the same query without the 
'group by* clause, where n is the number of distinct types of 
drugs. 

Q5 retrieves the distinct couples of doctor and type of pre- 
scribed drugs. This query can be made similar to Q4 by ex- 
pressing the distinct clause as an aggregate without function 
(\.e.,Xhe query "select distinct a\,. ,a n yh?/w... "isequiva- 
lent to "select ai , . . , a n from . . . group bya\>... , On"). The 
unique difference is that the computation for a given group, 
i.e., (distinct result tuple) can stop as soon as one tuple has 
been produced. 



5.3 Query optimization 

Heuristic optimization is attractive. However, well-known 
heuristics such as processing select and project first do not 
work here. Using extreme right-deep trees makes the former 
impractical and invalidates the latter. Heuristics for join order- 
ing are even more risky considering our data structures. Con- 
versely, there are many arguments for an exhaustive search 
of the best plan. First, the search space is limited since: (i) 
there is a single algorithm for each operator, depending on the 
existing indices; (ii) only extreme right-deep trees are consid- 
ered; and (iii) typical queries will not involve many relations. 
Second, exhaustive search using depth-first algorithms do not 
consume any RAM. Finally, exhaustive algorithms are simple 
and compact (even if they iterate a lot). Under the assump- 
tion that query optimization is required in a PicoDBMS, the 
remarks above strongly argue in favor of an exhaustive search 
strategy. 



5.4 Performance evaluation 

Our proposed query engine can handle fairly complex queries, 
taking advantage of the read and access rules 5 while satis- 
fying the compactness, write, RAM, and security rules. We 
now evaluate whether the PicoDBMS performance matches 
the smartcard application's requirements, that is, any query 
issued by the application can be performed in reasonable time 
(i.e., may not exceed the user's patience). Since the PicoDBMS 
code's simplicity is an important consideration to conform to 
the compactness and security rules, we must also evaluate 
which acceleration techniques (i.e., ring indices, query opti- 
mization) are really mandatory. For instance, an accelerator 
reducing the response time from 10 ms to 1 ms is useless in 
the smartcard context 6 . Thus, unlike traditional performance 
evaluation, our major concern is on absolute rather than rela- 
tive performance. 

5 With traditional DBMS, such techniques will induce so many 
disk accesses that the system would thrash! 

6 With traditional DBMS, such acceleration can improve the trans- 
actional throughput. 
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Fig. 6. Performance results for Ql 

Evaluating absolute response time is complex in the smart- 
card environment because all platform parameters (e.g., pro- 
cessor speed, caching strategy, RAM, and EEPROM speed) 
strongly impact on the measurements 7 . Measuring the per- 
formance of our PicoDBMS on Bull's smartcard technology 
is attractive but introduces two problems. First, Bull's smart- 
cards compatible with database applications are still proto- 
types [39]. Second, we are interested in providing the most 
general conclusions (i.e., as independent as possible of smart- 
card architectures). Therefore, we prefer to measure our query 
engine on two oldfashioned computers (a PC 486/25 Mhz and 
a Sun SparcStation 1+) which we felt roughly similar to forth- 
coming smartcard architectures. For each computer, we vary 
the system parameters (clock frequency, cache) and perform 
the experimentation tests. The performance ratios between 
all configurations were roughly constant (i.e., whatever the 
query), the slowest configuration (Intel 486 with no cache) per- 
forming eight times worse than the fastest (RISC with cache). 
In the following, we present response times for the slowest 
architecture to check the viability of our solutions in the worst 
environment. 

We generated three instances of a simplified healthcare 
database: the small, medium, and large databases containing, 
respectively, (10, 30, 50) doctors, (100, 500, 1,000) visits, 
(300, 2,000, 5,000) prescriptions, and (40, 1 20, 200) drugs. Al- 
though we tested several queries, we describe below only the 
two most significant. Query Q 1 , which contains three joins and 
two selects on Visit and Drug (with selectivities of 20% and 
5%), is representative of medium-complexity queries. Query 
Q4, which performs an aggregate on two attributes and re- 
quires the introduction of a Cartesian product, is representative 
of complex queries. For each query, we measure the perfor- 
mance for all possible query execution plans, excluding those 
which induce additional Cartesian product, varying the stor- 
age choices (with and without select and join ring indices). 
Figures 6 and 7 show the results for both best and worst plans 
on databases built with or without join indices. 

Considering SPJ queries, the PicoDBMS performance 
clearly matches the application's requirements as soon as join 
rings are used. Indeed, the performance with join rings is at 

7 With traditional DBMS, very slow disk access allows us to ignore 
finer parameters. 
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Fig. 7. Performance results for Q4 

most 1 46 ms for the largest database and with the worst execu- 
tion plan. With small databases, all the acceleration techniques 
can be discarded, while with larger ones, join rings remain nec- 
essary to obtain good response time. In that case, the absolute 
gain (1 10 ms) between the best and the worst plan does not 
justify the use of a query optimizer. 

The performance of aggregate queries is clearly the worst 
because they introduce a Cartesian product at the leaf of the 
execution tree. Join rings are useful for medium and large 
databases. With large databases, the optimizer turns out to 
be necessary since the worst execution plan with join rings 
achieves a rather long response time (20.6 s). 

The influence of ring indices for selects (not shown) is in- 
significant. Depending on the selectivity, it can bring slight im- 
provement or overhead on the results. Although it may achieve 
an important relative speed-up for the select itself, the abso- 
lute gain is not significant considering the small influence of 
select on the global query execution cost (which is not the case 
in disk-based DBMS). Select ring indices are, however, use- 
ful for queries with aggregates or duplicate removal, that can 
result in a join between a relation and the domain attribute. 
In that case, the select index plays the role of a join index, 
thereby generating a significant gain on large relations and 
large domains. 

Thus, this performance evaluation shows that our approach 
is feasible and that join indices are mandatory in all cases 
while query optimization turns out to be useful only with large 
databases and complex queries. 



6 Transaction management 

Like any data server, a PicoDBMS must enforce the well- 
known transactional ACID properties [8] to guarantee the con- 
sistency of the local data it manages as well as be able to 
participate in distributed transactions. We discuss below these 
properties with respect to a PicoDBMS. 

• Atomicity: local atomicity means that the set of actions 
performed by the PicoDBMS on a transaction's behalf 
is made persistent following the all or nothing scheme. 
Global atomicity: this means that all data servers - includ- 
ing the PicoDBMS - accessed by a distributed transaction 
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agree on the same transaction outcome (either commit or 
rollback). The distinguishing features of a PicoDBMS re- 
garding atomicity are no demarcation between main mem- 
ory and persistent storage, the dramatic cost of writes, and 
the fact that they cannot be deferred. 

• Consistency: this property ensures that the actions per- 
formed by the PicoDBMS satisfy all integrity constraints 
defined on the local data. Considering that traditional in- 
tegrity constraint management can be used, we do not dis- 
cuss it any further. 

• Isolation: this property guarantees the serializability of 
concurrent executions. A PicoDBMS manages personal 
data and is typically single-user 8 . Furthermore, smartcard 
operating systems do not even support multithreading. 
Therefore, isolation is useless here. 

• Durability: durability means that committed updates are 
never lost whatever the situation (i.e., even in case of a 
media failure). Durability cannot be enforced locally by 
the PicoDBMS because the smartcard is more likely to be 
stolen, lost or destroyed than a traditional computer. In- 
deed, mobility and smallness play against safety. Conse- 
quently, durability must be enforced through the network. 
The major issue is then preserving the privacy of data while 
delegating the durability to an external agent. 

The remainder of this section addresses local atomicity, 
global atomicity, and durability. 

6. 1 Local atomicity 

There are basically two ways to perform updates in a DBMS. 
The updates are either performed on shadow objects that are 
atomically integrated in the database at commit time or done 
in place (i.e., the transaction updates the shared copy of the 
database objects) [8]. We discuss these two traditional models 
below. 

• Shadow update: This model is rarely employed in disk- 
based DBMSs because it destroys data locality on disk 
and increases concurrent updates on the catalog. In a Pi- 
coDBMS, disk locality and concurrency are not a concern. 
This model has been shown to be convenient for smart- 
cards equipped with a small Flash memory [25]. However, 
it is poorly adapted to pointer-based storage models like 
RS since the object location changes at every update. In 
addition, the cost incurred by shadowing grows with the 
memory size. Indeed, either the granularity of the shadow 
objects increases or the paths to be duplicated in the cata- 
log become longer. In both cases, the writing cost - which 
is the dominant factor - increases. 

• Update in-place: write-ahead logging (WAL) [8] is re- 
quired in this model to undo the effects of an aborted 
transaction. Unfortunately, the relative cost of WAL is 
much higher in a PicoDBMS than in a traditional disk- 
based DBMS which uses buffering to minimize I/Os. In a 
smartcard, the log must be written for each update since 
each update becomes immediately persistent. This roughly 
doubles the cost of writing. 

8 Even if the data managed by the PicoDBMS are shared among 
multiple users (e.g., as in the healthcare application), the PicoDBMS 
serves a single user at a time. 



Despite its drawbacks, update in-place is better suited than 
shadow update for a PicoDBMS because it accommodates 
pointer-based storage models and its cost is insensitive to the 
rapid growth of stable memory capacity. We also propose two 
optimizations to update in-place: 

• Pointer-based logging: traditional WAL logs the values of 
all modified data. RS allows a finer granularity by logging 
pointers in place of values. The smallest the log records, 
the cheapest the WAL. The logging process must consider 
two types of information: 

• Values: in case of a tuple update, the log record must con- 
tain the tuple address and the old attribute values, that is a 
pointer for all RS stored attributes and a regular value for 
FS stored attributes. In case of a tuple insertion or deletion, 
assuming each tuple header contains a status bit (i.e., dead 
or alive), only the tuple address has to be logged in order 
to recover its state. 

• Rings: tuple insertion, deletion, and update (of a ring at- 
tribute) modify the structure of each ring traversing the 
corresponding tuple t. Since a ring is a circular chain of 
pointers, recovering its state means recovering the next 
pointer of t's predecessor (let us call it t VTe d). The infor- 
mation to restore in t pre d.next is either t's address if t has 
been updated or deleted, or t.next if t has been inserted, t's 
address already belongs to the log (see above) and t.next 
does not have to be logged since tfs content still exists in 
stable storage at recovery time. The issue is how to iden- 
tify tp re d at recovery time. Logging this information can 
be saved at the price of traversing the whole ring starting 
from £, until reaching t again. Thus, ring recovery comes 
for free in terms of logging. 

• Garbage-collecting values: insertion and deletion of do- 
main values (domain values are never modified) should 
be logged as any other updates. This overhead can be 
avoided by implementing a deferred garbage collector that 
destroys all domain values no longer referenced by any tu- 
ple. Garbage-collecting a domain amounts to execute an 
ad hoc semi-join operator between the domain and all re- 
lations varying on it which discards the domain values that 
do not match 9 . The benefit of this solution is threefold: (i) 
the lazy deletion of unreferenced values does not entail the 
storage model coherency; (ii) garbage-collecting domain 
values is required anyway by RS (even in the absence of 
transaction control); and (iii) a deferred garbage-collector 
can be implemented without reference counters, thereby 
saving storage space. The deferred garbage collector can- 
not work in the background since smartcards do not yet 
support multi-threading. The most pragmatic solution is to 
launch it manually when the card is nearly full. An alter- 
native to this manual procedure is to execute the garbage 
collector automatically at each card connection on a very 
small subset of the database (so that its cost remains hid- 
den to the user). Garbage-collecting the database in such 
an incremental way is straightforward since domain values 
are examined one after the other. 



9 Unlike reachability algorithms that start from the persistent roots 
and need marking [6], the proposed garbage-collector starts from the 
persistent leaves (i.e., the domain values) and exploits them one after 
the other, in a pipelined fashion (thus, it conforms to the RAM rule). 
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The update in-place model along with pointer-based log- 
ging and deferred garbage-collector reduces logging cost to its 
lowest bound, that is, a tuple address for inserted and deleted 
tuples, and the values of updated attributes (again, a pointer 
for DS and RS stored attributes). 



6.2 Global atomicity 

Global atomicity is traditionally enforced by an atomic com- 
mitment protocol (ACP). The most well known and widely 
used ACP is 2PC [8]. While extensively studied [19] and stan- 
dardized [21, 29, 41], 2PC suffers from the following weak- 
nesses in our context: 

• Need for a standard prepared state: any server must ex- 
ternalize the standard Xa interface [41] to participate to 
2PC. Unfortunately, ISO defines a transactional interface 
for smartcards but it does not cover distributed transactions 
[24]. In addition, participating to 2PC requires building a 
local prepared state that consumes valuable resources. 

• Disconnection means aborting: a smartcard can be ex- 
tracted from its terminal or its mobile host (e.g., a cellular 
phone) can be temporarily unreachable during 2PC. A par- 
ticipant's disconnection leads 2PC to abort the transaction 
even if all its operations have been successfully executed. 

• Badly adapted to moving participants: the 2PC incurs two 
message rounds to commit a transaction. Considering the 
high cost of wireless communication, the overhead is sig- 
nificant for mobile terminals equipped with a smartcard 
reader (e.g., PDA, cellular phones). 

As its name indicates, 2PC has two phases: the votingphasc 
and the decision phase. The voting phase is the means by 
which the coordinator checks whether or not the participants 
can locally guarantee the ACID properties of the distributed 
transaction. The decision is commit if all participants vote yes 
and abort otherwise. Thus, the voting phase introduces an 
uncertainty period at transaction termination that leads to the 
aforementioned drawbacks. 

Variations of one-phase commit protocols (IPC) have been 
recently proposed [2, 4, 35]. As stated in [2], IPC eliminates 
the voting phase of 2PC by enforcing the following properties 
on the participant's behavior: (1) all operations are acknowl- 
edged before the IPC is launched; (2) there are no deferred 
integrity constraints; (3) all participants are ruled by a rigorous 
concurrency control scheduler; and (4) all updates are logged 
on stable storage before 1 PC is launched. These assumptions 
guarantee, respectively, the A, C, I, D properties before the 
ACP is launched. Then, the ACP reduces to a single phase, that 
is broadcasting the coordinator's decision to all participants 
(this decision is commit if all transaction's operations have 
been successfully executed and abort otherwise). If a crash 
or a disconnection precludes a participant from conforming 
to this decision, the corresponding transaction branch is sim- 
ply forward recovered (potentially at the next reconnection). 
While the assumptions on the participant's behavior seem con- 
straining in the general case, they are quite acceptable in the 
smartcard context [10], Property (1) is common to all ACPs 
and is enforced by the IS07816 standard [22]; property (2) 
conforms to the fact that PicoDBMS have lighter capabilities 



than full-fledged DBMS; and property (3) is satisfied by def- 
inition since smartcards do not support parallel executions. 
Property (4) is discussed in Sect. 6.3. 

Eliminating the voting phase of the ACP solves altogether 
the three aforementioned problems. However, one may won- 
der about the interoperability between transaction managers 
and data managers supporting different protocols (either IPC 
or 2PC). We have shown in [1] that the participation of legacy 
(i.e., 2PC compliant) data managers in IPC is straightforward. 
Conversely, the participation of IPC compliant data managers 
(e.g., a smartcard) in the 2PC can be achieved by associat- 
ing a log agent to each participant. The role of the log agent 
is twofold. First, it manages the data manager's part of the 
IPC's coordinator log, forces it to stable storage during the 
2PC prepare phase, and exploits it if the transaction branch 
needs to be forward-recovered. Second, it translates the 2PC 
interface into that of IPC. The log agent can be located on the 
terminal, so that the benefit of IPC is lost for the terminal but 
it is preserved for the smartcard. 



6.3 Durability 

Most IPC protocols assume that the coordinator is in charge 
of logging all participants' updates before triggering the ACP 
(all these protocols belong to the coordinator log family). Co- 
ordinator log [35] and implicit yes vote [4] assume that the 
participants piggyback their log records on the acknowledg- 
ment messages of each operation while coordinator logical 
log [2] assumes that the coordinator logs all operations sent to 
each participant. In all cases, the durability of the distributed 
transaction relies on the coordinator log. Thus, IPC is a means 
by which global atomicity and durability can be solved alto- 
gether, at the same price. 

Two issues remain to be solved: (i) where to store the 
coordinator log; and (ii) how to preserve the security rule, 
that is, how to make the log content as secure as the data 
stored in the smartcard. Since the log must sustain any kind 
of failure, it must be stored on the network by a trustee server 
(e.g., a public organism, a central bank, the card issuer, etc.). 
If some transactions are executed in disconnected mode (e.g., 
on a mobile terminal), the durability will be effective only at 
the time the terminal reconnects to the network. Protecting 
the log content against attacks imposes encryption. The way 
encryption is performed depends on the model of logging. 
If the coordinator log is fed by the log records piggybacked 
by the participants, the smartcard can encrypt them with an 
algorithm based on a private key (e.g., DES [28]). Otherwise 
(i.e., if the coordinator logical log scheme is selected), the 
smartcard can provide the coordinator with a public key that 
will be used by the coordinator itself to encrypt its log [32]. 



6.4 Transaction cost evaluation 

The goal of this section is to approximate the time required by 
a representative update transaction. The objective is to confirm 
whether or not the write performance of smartcards assumed 
in this paper is acceptable for database applications like health 
cards. To this end, we estimate the time required to create a 
tuple in a relation, including the creation of domain values, 
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the insertion of the tuple in the rings potentially defined on 
this relation and the log time. Let us introduce the following 
parameters, in addition to those already defined in Sect. 4.4: 

• nbAttFS: number of FS stored attributes 

• nbAttDS: number of DS stored attributes 

• nbAttRS: number of RS stored attributes 

• w: size of a word (4 bytes in a 32-bit card) 

• t: time to write one word in stable storage 

(5 ms in the worst case) 

Cost(insertTuple) = 

([(nbAttFS*a + nbAttDS*p + nbAttRS*p)/w] // ® 

+ (nbAttRS + nbAttDS) * S * [a/w] // ® 

+ nbAttRS *[p/w] II® 

+ [p/w] II® 

) * t II© 

® Tuple size 

® Domain values size. S « probability 

to create a new domain value 
® Ring pointers to be updated 
® Log record size 
© Write time 

Let us consider a representative transaction executed on 
the healthcare. This transaction inserts a new tuple in Doctor 
and Visit and five tuples in Prescription and Drug, This is 
somehow a worst case for this application in the sense that the 
visited doctor is a new one and prescribes five new drugs. The 
considered attribute distribution is as follows: 

Doctor (nbAttFS=3, nbAttDS=4, nbAttRS=0), 

Visit (nbAttFS=2, nbAttDS=3, nbAttRS=2), 

Prescription (nbAttFS=l, nbAttDS^l, nbAttRS=2) ) 

Drug (nbAttFS=2, nbAttDS=4, nbAttRS=0). 

The average attribute length a is fixed to 10 bytes. Figure 8 
plots the update transaction execution time depending on S 
(5 = 0 means that all attribute values already exist in the 
domains, while 5=1 means that all these values need be 
inserted in the domains). 

The figure is self-explanatory. Note that the logging cost 
represents less than 3% of the total cost. This simple analysis 
shows that the time expected for this kind of transaction (less 
than 1 s) is clearly compatible with the healthcare application's 
requirements. 



7 Conclusion 

As smartcards become more and more versatile, multi- 
application, and powerful, the need for database techniques 
arises. However, smartcards have severe hardware limita- 
tions which make traditional database technology irrelevant. 
The major problem is scaling down database techniques so 
they perform well under these limitations. In this paper, we 
addressed this problem and proposed the design of a Pi- 
coDBMS, concentrating on the components which require 
non-traditional techniques (storage manager, query manager, 
and transaction manager). 

This paper makes several contributions. First, we an- 
alyzed the requirements for a PicoDBMS based on a 
healthcare application which is representative of personal 
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folder applications and has strong database requirements. 
We showed that the minimal functionality should include 
select/project/join/aggregate, access right management, and 
views as well as transaction's atomicity and durability. 

Second, we gave an in-depth analysis of the problem by 
considering the smartcard hardware trends. Based on this anal- 
ysis, we assumed a smartcard with a reasonable stable memory 
of a few megabytes and a small RAM of some kilobytes, and 
we derived design rules for a PicoDBMS architecture. 

Third, we proposed a new highly compact storage model 
that combines flat storage (FS), domain storage (DS), and ring 
storage (RS). Ring storage reduces the indexing cost to its 
lowest bound. Based on performance evaluation, we derived 
guidelines to decide the best way to store an attribute. 

Fourth, we proposed query processing techniques which 
handle complex query plans with no RAM consumption. This 
is achieved by considering extreme right-deep trees which can 
pipeline all operators of the plan including aggregates. We 
also argued that, if query optimization is needed, the strategy 
should be exhaustive search. We measured the performance of 
our execution model with an implementation of our query en- 
gine on two old-fashioned computers which we configured to 
be similar to forthcoming smartcard architectures. We showed 
that the resulting performance matches the smartcard applica- 
tion's requirements. 

Finally, we proposed techniques for transaction atomic- 
ity and durability. Local atomicity is achieved through up- 
date in-place with two optimizations which exploit the stor- 
age model: pointer-based logging and garbage collection of 
domain values. Global atomicity and durability are enforced 
by IPC which is easily applicable in the smartcard context 
and more efficient than 2PC. We showed that the performance 
of typical update transactions is acceptable for representative 
applications like the health card. 

This work is done in the context of a new project with 
Bull Smart Cards and Terminals. The next step is to port our 
PicoDBMS prototype on Bull's smartcard new technology, 
called OverSoft [12], and to assess its functionality and per- 
formance on real-world applications. To this end, a bench- 
mark dedicated to PicoDBMS must be set up. We also plan to 
address open issues such as protected logging for durability, 
query execution on encrypted data (e.g., stored in an external 
Flash), and statistics maintenance on a population of cards. 
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The coalesced hashing method is one of the faster 
searching methods known today. This paper is a practical 
study of coalesced hashing for use by those who intend 
to implement or further study the algorithm. Techniques 
are developed for tuning an important parameter that 
relates the sizes of the address region and the cellar in 
order to optimize the average running times of different 
implementations. A value for the parameter is reported 
that works well in most cases* Detailed graphs explain 
how the parameter can be tuned further to meet specific 
needs. The resulting tuned algorithm outperforms several 
well-known methods including standard coalesced hash- 
ing, separate (or direct) chaining, linear probing, and 
double hashing. A variety of related methods are also 
analyzed, including deletion algorithms, a new and im- 
proved insertion strategy called varied-insertion, and ap- 
plications to external searching on secondary storage 
devices. 
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One of the primary uses today for computer technol- 
ogy is information storage and retrieval. Typical search- 
ing applications include dictionaries, telephone listings, 
medical databases, symbol tables for compilers, and 
storing a company's business records. Each package of 
information is stored in computer memory as a record. 
We assume there is a special field in each record, called 
the key, that uniquely identifies it. The job of a searching 
algorithm is to take an input K and return the record (if 
any) that has K as its key. 

Hashing is a widely used searching technique because 
no matter how many records are stored, the average 
search times remain bounded. The common element of 
all hashing algorithms is a predefined and quickly com- 
puted hash function 

hash:{dHl possible keys} {1, 2, . . . , M) 

that assigns each record to a hash address in a uniform 
manner. (The problem of designing hash functions that 
justify this assumption, even when the distribution of the 
keys is highly biased, is well-studied [7, 2].) Hashing 
methods differ from one another by how they resolve a 
collision when the hash address of the record to be 
inserted is already occupied. 

This paper investigates the coalesced hashing algo- 
rithm, which was first published 22 years ago and is still 
one of the faster known searching methods [16, 7]. The 
total number of available storage locations is assumed to 
be fixed. It is also convenient to assume that these 
locations are contiguous in memory. For the purpose of 
notation, we shall number the hash table slots 1,2,..., 
AT. The first M slots, which serve as the range of the 
hash function, constitute the address region. The remain- 
ing M' — M slots are devoted solely to storing records 
that collide when inserted; they are called the cellar. 
Once the cellar becomes full, subsequent colliders must 
be stored in empty slots in the address region and, thus, 
may trigger more collisions with records inserted later. 

For this reason, the search performance of the coa- 
lesced hashing algorithm is very sensitive to the relative 
sizes of the address region and cellar. In Sec. 4, we apply 
the analytic results derived in [10, 11, 13] in order to 
optimize the ratio of their sizes, ft = M/M\ which we 
call the address factor. The optimizations are based on 
two performance measures: the number of probes per 
search and the running time of assembly language ver- 
sions. There is no unique best choice for /? — the optimum 
address factor depends on the type of search, the number 
of inserted records, and the performance measure cho- 
sen — but we shall see that the compromise choice j8 » 
0.86 works well in many situations. The method can be 
further turned to meet specific needs. 

Section 5 shows that this tuned method dominates 
several popular hashing algorithms including standard 
coalesced hashing (in which ft = 1), separate (or direct) 
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chaining, linear probing, and double hashing. The last 
three sections deal with variations and different imple- 
mentations for coalesced hashing including deletion al- 
gorithms, alternative insertion methods, and external 
searching on secondary storage devices. 

This paper is designed to provide a comprehensive 
treatment of the many practical issues concerned with 
the implementation of the coalesced hashing method. 
Readers interested in the theoretical justification of the 
results in this paper can consult [10, 1 1, 13, 14, 1]. 



2. The Coalesced Hashing Algorithm 

The algorithm works like this: Given a record with 
key K y the algorithm searches for it in the hash table, 
starting at location hash(K) and following the links in 
the chain. If the record is present in the table, then it is 
found and the search is successful; otherwise, the end of 
the chain is reached and the search is unsuccessful. For 
simplicity, we assume that the record is inserted when- 
ever the search ends unsuccessfully, according to the 
following rule: If position hash(K) is empty, then the 
record is stored at that location; else, it is placed in the 
largest-numbered empty slot in the table and is linked to 
the end of the chain. This has the effect of putting the 
first M' — M colliders into the cellar. 

Coalesced hashing is a generalization of the well- 
known separate (or direct) chaining method. The sepa- 
rate chaining method halts with overflow when there is 
no more room in the cellar to store a collider. The 
example in Fig. 1(a) can be considered to be an example 



of both coalesced hashing and separate chaining, because 
the cellar is large enough to store the three colliders. 

Figures 1(b) and 1(c) show how the two methods 
differ. The cellar contains only one slot in the example 
in Fig. 1(b). When the key mark collides with donna at 
slot 4, the cellar is already full. Separate chaining would 
report overflow at this point The coalesced hashing 
method, however, stores the key mark in the largest- 
numbered empty space (which is location 10 in the 
address region). This causes a later collision when dave 
hashes to position 10, so dave is placed in slot 8 at the 
end of the chain containing donna and mark. The 
method derives its name from this "coalescing" of rec- 
ords with different hash addresses into single chains. 

The average number of probes per search shows 
marked improvement in Fig. 1(b), even though coalesc- 
ing has occurred. Intuitively, the larger address region 
spreads out the records more evenly and causes fewer 
collisions, i.e., the hash function can be thought of as 
"shooting" at a bigger target. The cellar is now too small 
to store these fewer colliders, so it overflows. Fortunately, 
this overflow occurs late in the game, and the pileup 
phenomenon of coalescing is not significant enough to 
counteract the benefits of a larger address region. How- 
ever, in the extreme case when Af = AT = 11 and there 
is no cellar (which we call standard coalesced hashing), 
coalescing begins too early and search time worsens (as 
typified by Figure 1(c)). Determining the optimum ad- 
dress factor p = M/M' is a major focus of this paper. 

The first order of business before we can start a 
detailed study of the coalesced hashing method is to 
formalize the algorithm and to define reasonable 
measures of search performance. Let us assume that each 



Fig. 1. Coalesced hashing, Af'= 1 i, # = 8. The sizes of the address region are (a) M « 8, (b) M - 10, and (c) M - 1 1. 
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of the AT contiguous slots in the coalesced hash table 
has the following organization: 
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For each value of / between 1 and M ', EMPTY [/] is a 
one-bit field that denotes whether the /th slot is unused, 
KEY\i] stores the key (if any), and LINK[i] is either the 
index to the next spot in the chain or else the null value 
0. 

The algorithms in this article are written in the 
English-like style used by Knuth in order to make them 
readily understandable to all and to facilitate compari- 
sons with the algorithms contained in [7, 4, 12]. Block- 
structured languages, like PL/I and Pascal, are good for 
expressing complicated program modules; however, they 
are not used here, because hashing algorithms are so 
short that there is no reason to discriminate against those 
who are not comfortable with such languages. 

Algorithm C (Coalesced hashing search and insertion). 
This algorithm searches an AT -slot hash table, looking 
for a given key K. If the search is unsuccessful and the 
table is not fuU, then K is inserted. 

The size of the address region is M ; the hash function 
hash returns a value between 1 and M (inclusive). For 
convenience, we make use of slot 0, which is always 
empty. The global variable R is used to find an empty 
space whenever a collision must be stored in the table. 
Initially, the table is empty, and we have R = Af ' + 1; 
when an empty space is requested, R is decremented 
until one is found. We assume that the following initial- 
izations have been made before any searches or inser- 
tions are performed: M <— \fiM'\ for some constant 
0 < P < 1; EMPTY[i] «- true, for all 0 < i < M '; and 
* <_M'+ 1. 

CI. [Hash.] Set / <- hash(K). (Now 1 < i < Af.) 

C2. [Is there a chain?] If EMPTY[i], then go to step C6. 
(Otherwise, the /th slot is occupied, so we will look 
at the chain of records that starts there.) 

C3. [Compare.] If K= KEY[i], the algorithm terminates 
successfully. 

C4. [Advance to next record.] If LINK[i] ^ 0, then set 
/ <- LINK[i] and go back to step C3. 

C5. [Find empty slot.] (The search for K in the chain 
was unsuccessful, so we will try to find an empty 
table slot to store K.) Decrease R one or more times 
until EMPTY[R] becomes true. If R =» 0, then there 
are no more empty slots, and the algorithm termi- 
nates with overflow. Otherwise, append the Rth cell 
to the chain by setting LINK[i] «- R; then set / <- 
R. 

C6. [Insert new record.] Set EMPTY[i] <- false, KEY[i] 
*~ K, LINK[i] <- 0, and initialize the other fields in 
the record. ■ 



In this paper, we concern ourselves with measuring 
the searching phase of Algorithm C and ignore for the 
most part the insertion time in steps C5 and C6. (The 
time for step C5 is not significant, because the total 
number of times R is decremented over the course of all 
the insertions cannot be more than the number of in- 
serted records; hence, the amortized expected number of 
decrements is at most 1 . The decrementing operation can 
also be done in parallel with steps C1-C4.) Our primary 
measure of search performance is the number of probes 
per search, which is the number of different table slots 
that are accessed while searching. In Algorithm C, this 
quantity is equal to 

max{l, number of times step C3 is performed) 

For example, in Fig. 1(b), the unsuccessful searches for 
keys a.l. and tootie (immediately prior to their inser- 
tions) each took one probe, while a successful search for 
dave would take two probes. 

The average performance of the algorithm is ob- 
tained by assuming that all searches and insertions are 
random. The Appendix contains a discussion of the 
probability model as well as the formulas for the ex- 
pected number of probes in unsuccessful and successful 
searches. 

3. Assembly Language Implementation 

Even though probe-counting gives us a good idea of 
search performance, other factors (such as the complexity 
of the search loop and the overhead is computing the 
hash address) also affect the running time when Algo- 
rithm C is programmed for a real computer. For com- 
pleteness, we optimize the running time of assembly 
language versions of coalesced hashing. 

We choose to program in assembly language rather 
than in some high-level language like Fortran, PL/I, or 
Pascal, in order to achieve maximum possible efficiency. 
Top efficiency is important in large-scale applications of 
hashing, but it can also be achieved in smaller systems 
with little extra effort, because hashing algorithms are so 
short that implementing them (even in assembly lan- 
guage) is easy. We use a hypothetical language based on 
Knuth's mix [6] because its features are similar to most 
well-known machines and its inherent simplicity allows 
us to write programs in clear and concise form. 

Program C below is a Mix-like implementation of 
Algorithm C. Liberties have been taken with the lan- 
guage for purposes of clarity; the actual mix code appears 
in [10], The program is written in a five-column format: 
the first column gives the line numbers, the second 
column lists the instruction labels, the third column 
contains the assembly language instructions, the fourth 
column counts the number of times the instructions are 
executed, and the last column is for comments that 
explain what the instructions do. The syntax of the 
commands should be clear to those familiar with assem- 
bly language programming. The four memory registers 
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used in Program C are named rA, rX, rl, and rJ. The 
reference KEY(I) denotes the contents of the memory 
Location whose address is the value of KEY plus the 
contents of rl. (This is KEY[i] in the notation of Algo- 
rithm C.) 

Program C (Coalesced hashing search and insertion). 
This program follows the conventions of Algorithm C, 
except that the EMPTY field is implicit in the LINK 



01 


START 


LD 


X, K 


02 




ENT 


A,0 


03 




DIV 


=M= 


04 




ENT 


LX ' 


05 




INC 


1,1 


06 




LD 


A, K 


07 




LD 


J, LINK(I) 


08 




JN 


J, STEP6 


09 




CMP 


A, KEY(I) 


10 




JE 


SUCCESS 


11 




JZ 


J, STEP5 


12 


STEP4 


ENT 


I, J 


13 




CMP 


A, KEY(I) 


14 




JE 


SUCCESS 


15 




LD 


J, LINK(I) 


16 




JNZ 


J, STEP4 


17 


STEP5 


LD 


J,R 


18 




DEC 


J, 1 


19 




LD 


X, LINK(J) 


20 




JNN 


X, *-2 


21 




JZ 


J, OVERFLOW 


22 




ST 


J, LINK(I) 


23 




ENT 


1,J 


24 




ST 


J,R 


25 


STEP6 


ST 


0, LINK(I) 


26 




ST 


A, KEY(I) 



field: empty slots are marked by a — 1 in the LINK field 
of that slot. Null links are denoted by a 0 in the LINK 
field. The variable R and the key K are stored in memory 
locations R and K. Registers rl and rA are used to store 
the values of / and K. Register rJ stores either the value 
of LINK[i] or R. The instruction labels SUCCESS and 
OVERFLOW are for exiting and are assumed to lie 
somewhere outside this code. 

Step CI. Load rX with K. 
Enter 0 into rA. 
rA <- [K/ Ml rX<-K mod M . 
Enter rX into rl. 
Increment rl by 1. 
Load rA with K. 
Step C2. Load rJ with LINK\i\ 
Jump to STEP6 if LINK[i] < 0. 
Step CJ. Compare K with KEY[i\ 
Exit (successessfully) if K = KEY[i], 
Jump to STEPS if LINK[i] = 0. 
Step C4. Enter rJ into rl. 
Step Ch. Compare K with KEY\i\ 
Exit (successessfully) if K = KEY[i]. 
Load rJ with LINK\i\ 
Jump to STEP4 if LINK[i] * 0. 
Step C5. Load rJ with R. 
Decrement R by 1. 
Load rX with LINK[R]. 
Go back two steps if LINK[R] > 0. 
Exit (with overflow) if R = 0. 
Store R in LINK[i\ 
Enter rJ into rl. 
Update R in memory. 
Step C6. Store 0 in LINK[i\ 
Store tfii KEY[f\.* 



A 
A 

A-Sl 
C- I 
C- I 
C- 1 
C - 1 -52 
C - 1 - 52 
A -5 
T 
T 
T 

A-S 
A-S 
A-S 
A-S 
1-5 
1-5 



The execution time is measured in mix units of time, 
which we denote u. The number of time units required 
by an instruction is equal to the number of memory 
references (including the reference to the instruction 
itself). Hence, the LD, ST, and CMP instructions each 
take two units of time, while ENT, INC, DEC, and the 
jump instructions require only one time unit. The divi- 
sion operation used to compute the hash address is an 
exception to this rule; it takes 14u to execute. 

The running time of a mix program is the weighted 
sum 

(# times \/ # time units \ 
the instruction Jj required by J (1) 
is executed / \ the instruction/ 

in uiv pivgrtuu 

This is a somewhat simplistic model, since it does not 
make use of cache or buffered memory for fast access of 
frequently used data, and since it ignores any interven- 
tion by the operating system. But it places all hashing 
algorithms on an equal footing and gives a good indi- 
cation of relative merit. 



The fourth column of Program C expresses the num- 
ber of times each instruction is executed in terms of the 
quantities 

C = number of probes per search. 

A = 1 if the initial probe found an occupied slot, 

0 otherwise. 
5 = 1 if successful, 0 if unsuccessful. 
T = number of slots probed while looking for an empty 

space. 

We further decompose 5 into 51 + 52, where 5 1 = 1 if 
the search is successful on the first probe, and 51 « 0 
otherwise. By formula (1), the total running time of the 
searching phase is 

(7C + 4A + 17 - 35 + 251)w (2) 

and the insertion of a new record after an unsuccessful 
search (when 5 = 0) takes an additional (&4 + 47*+ 4)u. 
The average running time is the expected value of (2), 
assuming that all insertions and searches are random. 
The formula can be obtained by replacing the variables 
in Eq. (2) with their expected values. 
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4. Tuning ft to Obtain Optimum Performance 

The purpose of the analysis in [10, 11, 13] is to show 
how the average-case performance of the coalesced hash- 
ing method varies as a function of the address factor ft 
= M/M' and the load factor a = N/Af '. In this section, 
for each fixed value of a, we make use of those results in 
order to *tune" our choice of /3 and speed up the search 
times. Our two measures of performance are the expected 
number of probes per search and the average running 
time of assembly language versions. In the latter case, 
we study a mix implementation in detail, and then show 
how to apply what we learn to other assembly languages. 

Unfortunately, there is no single choice of /? that 
yields best results: the optimum choice /?opt is a function 
of the load factor a and it is even different for unsuc- 
cessful and successful searches. The section concludes 
with practical tips on how to initialize /?. In particular, 
we shall see that the choice ft ~ 0.86 works well in most 
situations. 

4.1 Number of Probes Per Search 

For each fixed value of a, we want to find the values 
/?opt that minimize the expected number of search probes 
in unsuccessful and successful searches. Formulas (Al) 
and (A2) in the Appendix express the average number 
of probes per search as a function of three variables: the 
loaid factor a = N/M\ the address factor fi = M/M\ 
and a new variable A = L/M t where L is the expected 
number of inserted records needed to make the cellar 
become full. The variables /? and A are related by the 
formula 

e"* + A = l (3) 

Formulas (Al) and (A2) each have two cases, "a < 
A/?" and "a > A/?" which have the following intuitive 
meanings: The condition a < \(3 means that with high 
probability not enough records have been inserted to fill 
up the cellar, while the condition a > A/? means that 
enough records have been inserted to make the cellar 
almost surely full. 

The optimum address factor £opt is always located 
somewhere in the "a > A/?" region, as shown in the 
Appendix. The rest of the optimization procedure is a 
straightforward application of differential calculus. First, 
we substitute Eq. (3) into the 4< a > A/?" cases of the 
formulas for the expected number of probes per search 
in order to express them in terms of only the two 
variables a and A. For each nonzero fixed value of a, the 
formulas are convex w.r.t. A and have unique minima. 
We minimize them by setting their derivatives equal to 
0. Numerical analysis techniques are used to solve the 
resulting equations and to get the optimum values of A 
for several different values of a. Then we reapply Eq. (3) 
to express the optimum points in terms of /?. The results 
are graphed in Fig. 2(a), using spline interpolation to fill 
in the gaps. 



4.2 mix Running Times 

Optimizing the mix execution times could be tricky, 
in general, because the formulas might have local as well 
as global minima. Then when we set the derivatives 
equal to 0 in order to find /?opt, there might be several 
roots to the resulting equations. The crucial fact that lets 
us apply the same optimization techniques we used above 
for the number of probes is that the formulas for the mix 
running times are well-behaved, as shown in the Appen- 
dix. By that we mean that each formula is minimized at 
a unique /?opt, which occurs either at the endpoint a = 
\fi or at the unique point in the "a > A/?" region where 
the derivative w.r.t. ft is 0. 

The optimization procedure is the same as before. 
The expected values of formulas (A4) and (A5), which 
give the mix running times for unsuccessful and success- 
ful searches, are functions of the three variables a, /?, and 
A. We substitute Eq. (3) into the expected running times 
in order to express /? in terms of A. For several different 
load factors a and for each type of search, we find the 
value of A that minimizes the formula, and then we 
retranslate this value via Eq. (3) to get 0opt. Figure 2(b) 
graphs these optimum values /?opt as a function of a; 
spline interpolation was used to fill in the gaps. As in the 
previous section, the formulas for the average unsuccess- 
ful and successful search times yield different optimum 
address factors. For the successful search case, notice 
how closely /?opt agrees with the corresponding values 
that minimize the expected number of probes. 



Fig. 2. The values /?opt that optimize search performance for the 
following three measures: (a) the expected number of probes per 
search, (b) the expected running time of Program C, and (c) the 
expected assembly language running time for large keys. 
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43 Applying the Results to Other Implementations 

Our mix analysis suggests two important principles 
to be used in finding fiorr for a particular implementa- 
tion of coalesced hashing. First, the formulas for the 
expected number of times each instruction in the pro- 
gram is executed (which are expressed for Program C in 
terms of C, A % S, S 1, S2, and T) may have the two cases, 
"a < A/?" and "« > Aft" but probably not more. 

Second, the same optimization process as above can 
be used to find /forr, because the formulas for the 
running times should be well-behaved for the following 
reason: The main difference between Program C and 
another implementation is likely to be the relative time 
it takes to process each key. (The keys are assumed to be 
very small in the mix version.) Thus, the unsuccessful 
search time for another implementation might be ap- 
proximately 

[(2k + 5)C + (2k + 2)A + (-2k + 19)]u' (4) 

where u' is the standard unit of time on the other 
computer and ie is how many times longer it takes to 
process a key (multiplied by u/u'). Successful search 
times would be about 

[(2* + 5)C + 18 + 251]«' (5) 

Formulas (4) and (5) were calculated by increasing the 
execution times of the key-processing steps 9 and 13 in 
Program C by a factor of k. (See formulas (A4) and (A5) 
for the k = 1 case.) We ignore the extra time it takes to 
load the larger key and to compute the hash function, 
since that does not affect the optimization. 

The role of C in formula (4) is less prevalent than in 
(A4) as k gets large: the ratio of the coefficients of C and 
A decreases from 7/4 in (A4) and approaches the limit 
2/2 = 1 in formula (4). Even in this extreme case, 
however, computer calculations show that the formula 
for the average running time is well-behaved. The values 
of /?opt that minimize formula (4) when k is large are 
graphed in Fig. 2(c). 

For successful searches, however, the value of C more 
strongly dominates the running times for larger values of 
tc, so the limiting values of /?opt in Fig. 2(c) coincide with 
the ones that minimize the expected number of probes 
per search in Fig. 2(a). Figure 2(b) shows that the 
approximation is close even for the case k ~ 1, which is 
Program C. 

4.4 How to Choose fi 

It is important to remember that the address region 
size M — f/J A/'l must be initialized when the hash table 
is empty and cannot change thereafter. Unfortunately, 
the last two sections show that each different load factor 
a requires a different optimum address factor /?opt; in 
fact, the values of /?opt differ for unsuccessful and suc- 
cessful searches. This means that optimizing the average 
unsuccessful (or successful) search time for a certain load 
factor a will lead to suboptimum performance when the 
load factor is not equal to a. 



One strategy is to pick fi a 0.782, which minimizes 
the expected number of probes per unsuccessful search 
as well as the average mix unsuccessful search time when 
the table is full (i.e., load factor a = 1), as indicated in 
Fig. 2. This choice of fi yields the best absolute bound 
on search performance, because when the table is full, 
search times are greatest and unsuccessful searches av- 
erage slightly longer than successful ones. Regardless of 
the load factor, the expected number of probes per search 
would be at most 1.79, and the average mix searching 
time would be bounded by 33.52m. 

Another strategy is to pick some compromise address 
factor that leads to good overall performance for a large 
range of load factors. A reasonable choice is fi = 0.86; 
then the unsuccessful searches are optimized (over all 
other values of fi) when the load factor is a0.68 (number 
of probes) and a0.56 (mix), and the successful search 
performance is optimized at load factors aO.94 (number 
of probes) and =0.95 (mix). 

Figures 3 through 6 graph the expected search per- 
formance of coalesced hashing as a function of a for 
both types of searches (unsuccessful and successful) and 
for both measures of performance (number of probes 
and mix running time). The d curve corresponds to 
standard coalesced hashing (i.e., fi = I); the Case line is 
our compromise choice fi = 0.86; and the dashed line 
Copt represents the best possible search performance 
that could be achieved by tuning (in which fi is optimized 
for each load factor). 

Notice that the value fi *= 0.86 yields near-optimum 
search times once the table gets half-full, so this compro- 
mise offers a viable strategy. Of course, if some prior 
knowledge about the types and frequencies of the 
searches were available, we could tailor our choice of fi 
to meet those specific needs. 

5. Comparisons 

In this section, we compare the searching times of the 
coalesced hashing method with those from a represent- 
ative collection of hashing schemes: standard coalesced 
hashing (Ci), separate chaining (S), separate chaining 
with ordered chains (SO), linear probing (L), and double 
hashing (D). Implementations of the methods are given 
in [10]. 

These methods were chosen because they are the 
most well-known and since they each have implemen- 
tations similar to that of Algorithm C. Our comparisons 
are based both on the expected number of probes per 
search as well as on the average mix running time. 
Coalesced hashing performs better than the other 
methods. The differences are not so dramatic with the 
mix search times as with the number of probes per 
search, due to the large overhead in computing the hash 
address. However, if the keys were larger and compari- 
sons took longer, the relative mix savings would closely 
approximate the savings in number of probes. 
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Fig. 3. The average number of probes per unsuccessful search, as M 
and M ' -» oo, for coalesced hashing (d, Cass, Copt for ft = 1, 0.86, 
/*opt)» separate chaining (S), separate chaining with ordered chains 
(SO), linear probing (L), and double hashing (D). 




0.4 0.5 (Mi 
\josh1 rjLitir, a or a 



5.1 Standard Coalesced Hashing (Ci) 

Standard coalesced hashing is the special case of 
coalesced hashing for which /? = 1 and there is no cellar. 
This is obviously the most realistic comparison that can 
be made, because except for the initialization of the 
address region size, standard coalesced hashing and 

Fig. 5. The average mix execution time per unsuccessful search, as 
M' -> oo, for coalesced hashing (C,, Cow, Q>pt for p » I, 0.86, 0opt), 
separate chaining (S), separate chaining with ordered chains (SO), 
linear probing (L), and double hashing (D). 




Fig. 4. The average number of probes per successful search, as M and 
M* -> oo, for coalesced hashing (C, Co*, Copt for fi - i, 0.86, #>pt), 
separate chaining (S), separate chaining with ordered chains (SO), 
linear probing (L), and double hashing (D). 




0.4 0.5 0.fi 
UkmI buur. ii or o 



"tuned" coalesced hashing are identical. Figures 3 and 
4 show that the savings in number of probes per search 
can be as much as 14 percent (unsuccessful) and 6 
percent (successful). In Figs. 5 and 6, the corresponding 
savings in mix searching time is 6 percent (unsuccessful) 
and 2 percent (successful). 

Fig. 6. The average mix execution time per successful search, as 
M' -> oo, for coalesced hashing (C,, Co*,, Copt for 0 = 1, 0.86, #jpt), 
separate chaining (S), separate chaining with ordered chains (SO), 
linear probing (L), and double hashing (D). 
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5.2 Separate (or Direct) Chaining (S) 

The separate chaining method is given an unfair 
advantage in Figs. 3 and 4: the number of probes per 
search is graphed as a function of a — N/M rather than 
a = N/M* and does not take into account the number of 
auxiliary slots used to store colliders. In order to make 
the comparison fair, we must adjust the load factor 
accordingly. 

Separate chaining implementations are designed of- 
ten to accommodate about N = M records; an average 
of M(\ - \/M) M M/e auxiliary slots are needed to 
store the colliders. The total table size is thus M' x M 
+ M/e. Solving backwards for Af, we get M as 0.731 AT'. 
In other words, we may consider separate chaining to be 
the special case of coalesced hashing for which ft ~ 
0.731, except that no more records can be inserted once 
the cellar overflows. Hence, the adjusted load factor is 
a ~ 0.731a, and overflow occurs when there are around 
jsf = M =2 0.731 AT inserted records. (This is a reasonable 
space/time compromise: if we make M smaller, then 
more records can usually be stored before overflow 
occurs, but the average search times blow up; if we 
increase M to get better search times, then overflow 
occurs much sooner, and many slots are wasted.) 

If we adjust the load factors in Figs. 3 and 4 in this 
way, Algorithm C generates better search statistics: the 
expected number of probes per search for separate chain- 
ing is s 1.37 (unsuccessful) and « 1.5 (successful) when 
the load factor 5 is I, while that for coalesced hashing is 
s 1.32 (unsuccessful) and « 1.44 (successful) when the 
load factor a = fia is equal to 0.731. 

The graphs in Figs. 5 and 6 already reflect this load 
factor adjustment. In fact, the mix implementation of 
separate chaining (Program S in [10]) is identical to 
Program C, except that /? is initialized to 0.731 and 
overflow is signaled automatically when the cellar runs 
out of empty slots. Program C is slightly quicker in mix 
execution time than Program S, but more importantly, 
the coalesced hashing implementation is more space 
efficient: Program S usually overflows when a =s 0.731, 
while Program C can always obtain full storage utiliza- 
tion a s 1. This confirms our intuition that coalesced 
hashing can accomodate more records than the separate 
chaining method and still outperform separate chaining 
before that method overflows. 

5.3 Separate Chaining with Ordered Chains (SO) 
This method is a variation of separate chaining in 

which the chains are kept ordered by key value. The 
expected number of probes per successful search does 
not change, but unsuccessful searches are slightly 
quicker, because only about half the chain needs to be 
searched, on the average. 

Our remarks about adjusting the load factor in Figs. 
3 and 4 also apply to method SO. But even after that is 
done, the average number of probes per unsuccessful 
search as well as the expected mix unsuccessful search 
time is slightly better for this method than for coalesced 
hashing. However, as Fig. 6 illustrates, the average suc- 



cessful search time of Program SO is worse than Program 
Cs, and in real-life situations, the difference is likely to 
be more apparent, because records that are inserted first 
tend to be looked up more often and should be kept near 
the beginning of the chain, not rearranged. 

Method SO has the same storage limitations as the 
separate chaining scheme (i.e., the table usually over- 
flows when N & M » 0.731 A/'), whereas coalesced 
hashing can obtain full storage utilization. 

5.4 Linear Probing (L) and Double Hashing (D) 

When searching for a record with key K, the linear 
probing method first checks location hash(K) t and if 
another record is already there, it steps cyclically through 
the table, starting at location hash(K), until the record is 
found (successful search) or an empty slot is reached 
(unsuccessful search). Insertions are done by placing the 
record into the empty slot that terminated the unsuc- 
cessful search. Double hashing generalizes this by letting 
the cyclic step size be a function of K. 

We have to adjust the load factor in the opposite 
direction when we compare Algorithm C with methods 
L and D, because the latter do not require LINK fields. 
For example, if we suppose that the LINK field com- 
prises i of the total record size in a coalesced hashing 
implementation, then the search statistics in Figs. 3 and 
4 for Algorithm C with load factor a should be compared 
against those for linear probing and double hashing with 
load factor (J)a. In this case, the average number of 
probes per search is still better for coalesced hashing. 

However, the LINK field is often much smaller than 
the rest of the record, and sometimes it can be included 
in the table at virtually no extra cost. The mix imple- 
mentation Program C in [10] assumes that the mix field 
can be squeezed into the record without need of extra 
storage space. Figures 5 and 6, therefore, require no load 
factor adjustment. 

To balance matters, the mix implementations of lin- 
ear probing and double hashing, which are given in [10] 
and [7], contain two code optimizations. First, since 
LINK fields are not used in methods L and D, we no 
longer need 0 to denote a null LINK, and we can 
renumber the table slots from 0 to Af' - 1; the hash 
function now returns a value between 0 and M' - 1. 
This makes the hash address computation faster by lw, 
because the instruction INC I, 1 can be eliminated. 
Second, the empty slots are denoted by the value 0 in 
order to make the comparisons in the inner loop as fast 
as possible. This means that records are not allowed to 
have a key value of 0. The final results are graphed in 
Figs. 5 and 6. Coalesced hashing clearly dominates when 
the load factor is greater than 0.6. 



6. Deletions 

It is often useful in hashing applications to be able to 
delete records when they no longer logically belong to 
the set of objects being represented in the hash table. For 
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example, in an airlines reservations system, passenger 
records are often expunged soon after the flight has 
taken place. 

One possible deletion strategy often used for linear 
probing and double hashing is to include a special one- 
bit DELETED field in each record that says whether or 
not the record has been deleted. The search algorithm 
must be modified to treat each "deleted" table slot as if 
it were occupied by a null record, even though the entire 
record is still there. This is especially desirable when 
there are pointers to the records from outside the table. 

If there are no such external pointers to worry about, 
the "deleted" table slots can be reused for later insertions: 
Whenever an empty slot is needed in step C5 of Algo- 
rithm C, the record is inserted into the first "deleted" 
slot encountered during the unsuccessful search; if there 
is no such slot, an empty slot is allocated in the usual 
way. However, a certain percentage of the "deleted" slots 
probably will remain unused, thus preventing full storage 
utilization. Also, insertions and deletions over a pro- 
longed period would cause the expected search times to 
approximate those for a full table, regardless of the 
number of undeleted records, because the "deleted" 
records make the searches longer. 

If we are willing to spend a little extra time per 
deletion, we can do without the DELETED field by 
relocating some of the records that follow in the chain. 
The basic idea is this: First, we find the record we want 
to delete, mark its table slot empty, and set the LINK 
field of its predecessor (if any) to the null value 0. Then 
we use Algorithm C to reinsert each record in the re- 
mainder of the chain, but whenever an empty slot is 
needed in step C5, we use the position that the record 
already occupies. 

This method can be illustrated by deleting al from 
location 10 in Fig. 7(a); the end result is pictured in Fig. 
7(b). The first step is to create a hole in position 10 where 
al was, and to set Audrey's LINK field to 0. Then we 
process the remainder of the chain. The next record 



tootie rehashes to the hole in location 10, so tootie 
moves up to plug the hole, leaving a new hole in position 
9. Next, donna collides with audrey during rehashing, 
so donna remains in slot 8 and is linked to Audrey. 
Then mark also collides with audrey; we leave mark in 
position 7 and link it to donna, which was formerly at 
the end of Audrey's hash chain. The record jeff rehashes 
to the hole in slot 9, so we move it up to plug the hole, 
and a new hole appears in position 6. Finally, dave 
rehashes to position 9 and joins jeff's chain. 

Location 6 is the current hole position when the 
deletion algorithm terminates, so we set EMPTY[6) «- 
true and return it to the pool of empty slots. However, 
the value of R in Algorithm C is already 5, so step C5 
will never try to reuse location 6 when an empty slot is 
needed. 

We can solve this problem by using an available- 
space list in step C5 rather than the variable R; the list 
must be doubly linked so that a slot can be removed 
quickly from the list in step C6. The available-space list 
does not require any extra space per table slot, since we 
can use the KEY and LINK fields of the empty slots for 
the two pointer fields. (The KEY field is much larger 
than the LINK field in typical implementations.) For 
clarity, we rename the two pointer fields NEXT and 
PREV. Slot 0 in the table acts as the dummy start of the 
available-space list, so NEXT[0] points to the first actual 
slot in the list and PREV[0] points to the last. Before 
any records are inserted into the' table, the following 
extra initializations must be made: NEXT[0] <— M' 
PREV[M'] <- 0; and NEXT[i] <- / - 1 and PREV[i - 
1] < — /, for 1 < i < M'. We replace steps C5 and C6 by 

C5. [Find empty slot.] (The search for K in the chain 
was unsuccessful, so we will try to find an empty 
table slot to store K.) If the table is already full (i.e., 
NEXT[0] = 0), the algorithm terminates with over- 
flow. Otherwise, set LINK[i] <- NEXT[Q] and / 
NEXT[0] t 

C6. [Insert new record.] Remove the ith slot from the 



Fig. 7. (a) Inserting the eight records; (b) Inserting all the records except al. 
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available-space list by setting PREV[NEXT[i]] <- 
PREV[i] and NEXT[PREV[i]] NEXT[Q. Then 
set EMPTY[i\ «- false, KEY[i] K, LINK[i\ <- 
0, and initialize the other fields in the record. 

The following deletion algorithm is analyzed in 
detail in [10] and [14]. 

Algorithm CD {Deletion with coalesced hashing). This 
algorithm deletes the record with key K from a coalesced 
hash table constructed by Algorithm C, with steps C5 
and C6 modified as above. 

This algorithm preserves the important invariant that 
K is stored at its hash address if and only if it is at the 
start of its chain. This makes searching for K 's predeces- 
sor in the chain easy: if it exists, then it must come at or 
after position hash{K) in the chain. 

CD1. [Search for K.] Set / «- hash(K). If EMPTY[i], 
then AT is not present in the table and the algorithm 
terminates. Otherwise, if K = KEY\i\ then K is at 
the start of the chain, so go to step CD3. 

CD2. [Split chain in two.] (K is not at the start of its 
chain.) Repeatedly set PRED «— i and / <— 
LINK[i] until either i = 0 or K « KEY\i\ If / = 
0, then K is not present in the table, and the 
algorithm terminates. Else, set LJNK[PRED] <- 
0. 

CD3. [Process remainder of chain.] (Variable / will walk 
through the successors of K in the chain.) Set 
HOLE *- /, / <- LJNK\Jl LINK[HOLE] <- 0. 
Do step CD4 zero or more times until i = 0. Then 
go to step CDS. 

CD4. [Rehash record in Ah slot.] Set j <- hash(KEY[i]). 
If j = HOLE, we move up the record to plug the 
hole by setting KEY[HOLE] <- KEY[i] and 
HOLE <— /. Otherwise, we link the record to the 
end of its hash chain by doing the following: set 
j *— LINK[j] zero or more times until LINK[j] 
= 0; then set LINK[j] <- L Set k +- LJNK[i] y 
LINK[i] *- 0, and i *- k. Repeat step CD4 unless 
r = 0. 

CDS, [Mark slot HOLE empty.] Set EMPTY[HOLE] 
true. Place HOLE at the start of the available- 
space list by setting NEXT[HOLE] <- NEXT[0] 9 
PREV[HOLE] +-Q, PREV[NEXT[0]] +-HOLE, 
NEXT[0] <- HOLE. ■ 

Algorithm CD has the important property that it 
preserves randomness for the special case of standard 
coalesced hashing (when M = AT ), in that deleting a 
record is in some sense like never having inserted it. The 
"sense" is strong enough so that the formulas for the 
average search times are still valid after deletions are 
performed. Exactly what preserving randomness means 
is explained in detail in [14]. 

We can speed up the rehashing phase in the latter 
half of step CD4 by linking the record into the chain 
immediately after its hash address rather than at the end 
of the chain. When this modified deletion algorithm is 
called on a random standard coalesced hash table, the 



resulting table is better-than-random: the average search 
times after N random insertions and one deletion are 
sometimes better (and never worse) than they would be 
with N — I random insertions alone. Whether or not this 
remains true after more than one deletion is an open 
problem. 

If this deletion algorithm is used when there is a 
cellar (i.e., /? < 1), we can modify it so that whenever a 
hole appears in the cellar during the execution of Algo- 
rithm CD, the next noncellar record in the chain moves 
up to plug the hole. Unfortunately, even with this mod- 
ification, the algorithm does not break up chains well 
enough to preserve randomness. It seems possible that 
search performance may remain very good anyway. 
Analytic and empirical study is needed to determine just 
"how far from random" the search times get after dele- 
tions are performed. 

Two remarks should be made about implementing 
this modified deletion algorithm. In step CD6, the empty 
slot should be returned to the start of the available-space 
list when the slot is in the cellar, otherwise, it should be 
placed at the end. This has the effect of giving cellar slots 
higher priority on the available-space list. Second, if a 
cellar slot is freed by a deletion and then reallocated 
during a later insertion, it is possible for chain to go in 
and out of the cellar more than once. Programmers 
should no longer assume that a chain's cellar slots im- 
mediately follow the start of the chain. 

7. Implementations and Variations 

Most important searching algorithms have several 
different implementations in order to handle a variety of 
applications; coalesced hashing is no exception. We have 
already discussed some modifications in the last section 
in connection with deletion algorithms. In particular, we 
needed to use a doubly linked available-space list so that 
the empty slots could be added and removed quickly. 
Thus, the cellar need not be contiguous. Another strategy 
to handle a noncontiguous cellar is to link all the table 
slots together initially and to replace "Decrease R" in 
step C5 of Algorithm C with "Set R *- LINK[Ry With 
either modification, Algorithm C can simulate the sepa- 
rate chaining method until the cellar empties; subsequent 
colliders can be stored in the address region as usual. 
Hence, coalesced hashing can have the benefit of dy- 
namic allocation as well as total storage utilization. 

Another common data structure is to store pointers 
to the fields, rather than the fields themselves, in the 
table slots. For example, if the records are large, we 
might want to store only the key and link values in each 
slot, along with a pointer to where the rest of the record 
is located. We expand upon this idea later in this section. 

If we are willing to do extra work during insertion 
and if the records are not pointed to from outside the 
table, we can modify the insertion algorithm to prevent 
the chains from coalescing: When a record R\ collides 
during insertion with another record R2 that is not at the 
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t start of the chain, we store R\ at its hash address and 

relocate R 2 to some other spot. (The LINK field of Rjs 
predecessor must be updated.) The size of the records 
should not be very large or else the cost of rearrangement 
might get prohibitive. There is an alternate strategy that 
prevents coalescing and does not relocate records, but it 
requires an extra link field per slot and the searches are 
slightly longer. One link field is used to chain together 
all the records with the same hash address. The other 
link field contains for slot i a pointer to the start of the 
chain of records with hash address /. Much of the space 
for the link fields is wasted, and chains may start one 
link away from their hash address. Resources could be 
put to better use by using coalesced hashing. 

This section is devoted to the more nonobvious im- 
plementations of coalesced hashing. First, we describe 
alternate insertion strategies and then conclude with 
three applications to external searching on secondary 
storage devices. A scheme that allows the coalesced hash 
table to share memory with other data structures can be 
found in [12]. A generalization of coalesced hashing that 
uses nonuniform hash functions is described in [13]. 

7.1 Early-Insertion and Varied-Insertion Coalesced 
Hashing 

If we know a priori that a record is not already 
present in the table, then it is not necessary in Algorithm 
C to search to the end of the chain before the record is 
inserted: If the hash address location is empty, the record 
can be inserted there; otherwise, we can link the record 
into the chain immediately after its hash address by 
rerouting pointers. We call this the early-insertion method 
because the collider is linked "early" in the chain, rather 
than at the end. We will refer to the unmodified algo- 



rithm (Algorithm C in Sec. 2) as the late-insertion 
method. 

Early-insertion can be used even if we do not have a 
priori knowledge about the record's presence, in which 
case the entire chain must be searched in order to verify 
that the record is not already stored in the table. We can 
implement this form of early-insertion by making the 
following two modifications to Algorithm C. First, we 
add the assignment "Set j <- /" at the end of step C2, so 
that j stores the hash address hash(K). The second 
modification replaces the last sentence of step C5 by 
"Otherwise, link the Rih cell into the chain immediately 
after the hash address j by setting LINK[R] «- LINK[j], 
LINK[j] *- R\ then set i <- R." 

Each chain of records formed using early-insertion 
contains the same records as the corresponding chain 
formed by late-insertion. Since the length of a random 
unsuccessful search depends only on the number of 
records in the chain between the hash address and the 
end of the chain, and since all the records are in the 
address region when there is no cellar, it must be true 
that the average number of probes per unsuccessful 
search is the same for the two methods if there is no 
cellar. However, the order of the records within each 
chain may be different for early-insertion than for late- 
insertion. When there is no cellar, the early-insertion 
algorithm causes the records to align themselves in the 
chains closer to their hash addresses, on the average, 
than would be the case with late-insertion, so the ex- 
pected successful search times are better. 

A typical case is illustrated in Fig. 8. The record dave 
collides with ax. at slot 5. In Fig. 8(a), which uses late- 
insertion, dave is linked to the end of the chain contain- 
ing a.l., whereas if we use early-insertion as in Fig. 8(b), 



Fig. 8. Standard Coalesced Hashing, M - M' =• 1 1, N - 8. (a) Late-insertion; (b) Early-insertion. 
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dave is linked into the chain at the point between a.l. 
and al. The average successful search time in Fig. 8(b) 
is slightly better than in Fig. 8(a), because linking dave 
into the chain immediately after a.l. (rather than at the 
end of the chain) reduces the search time for dave from 
four probes to two and increases the search time for al 
from two probes to three. The result is a net decrease of 
one probe. 

One can show easily that this effect manifests itself 
only on chains of length greater than 3, so there is little 
improvement when the load factor a is small, since the 
chains are usually short. Recent theoretical results show 
that the average number of probes per successful search 
is 5 percent better with early-insertion than with late- 
insertion when there is no cellar and the table is full (i.e., 
a = 1), but is only 0.5 percent better when a — 0.5 
[1, 5]. A possible disadvantage of early-insertion is that 
earlier colliders tend to be shoved to the rear by later 
ones, which may not be desirable in some practical 
situations when the records inserted first tend to be 
accessed more often than those inserted later. Neverthe- 
less, early-insertion is an improvement over late-insertion 
when there is no cellar. 

When there is a cellar, preliminary studies indicate 
that search performance is probably worse with early- 
insertion than with Algorithm C, because a chain's rec- 
ords that are in the cellar now come at the end of the 
chain, whereas with late-insertion they come immedi- 
ately after the start. In the example in Fig. 9(b), the 
insertion of jeff causes both cellar records al and tootie 
to move one link further from their hash addresses. That 
does not happen with late-insertion in Fig. 9(b). 

We shall now introduce a new variant, called varied- 
insertion, that can be shown to be better than both the 
late-insertion and early-insertion methods when there is 
a cellar. When there is no cellar, varied-insertion is 



identical to early-insertion. In the varied-insertion 
method, the early-insertion strategy is used except when 
the cellar is full and the hash address of the inserted 
record is the start of a chain that has records in the 
cellar. In that case, the record is linked into the chain 
immediately after the last cellar slot in the chain. 

Figure 9(c) shows a typical hash table constructed 
using varied-insertion. The cellar is already full when 
the record dave is inserted. The hash address of dave is 
1, which is at the start of a chain that has records in the 
cellar. Therefore, early-insertion is not used, and dave 
is instead linked into the chain immediately after al, 
which is the last record in the chain that is in the cellar. 
The average number of probes per search is better for 
varied-insertion than for both late-insertion and early- 
insertion. 

The varied-insertion method incorporates the advan- 
tages of early-insertion, but without any of the drawbacks 
described three paragraphs earlier. The records of a 
chain that are in the cellar always come immediately 
after the start of the chain. The average number of 
probes per search for varied-insertion is always less than 
or equal to that for late-insertion and early-insertion. 
For unsuccessful searches, the expected number of 
probes for varied-insertion and late-insertion are identi- 
cal. 

Research is currently underway to determine the 
average search times for the varied-insertion method, as 
well as to find the values of the optimum address factor 
Povt. We expect that the initialization /? <- 0.86 will be 
preferred in most situations, as it is for late-insertion. 
The resulting search times for varied-insertion should be 
a slight improvement over late-insertion. 

The idea of linking the inserted record into the chain 
immediately after its hash address has been incorporated 
into the first modification of Algorithm CD in the last 



Fig. 9. Coalesced Hashing, M' 
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section. It is natural to ask whether the modified deletion 
algorithm would preserve randomness for the modified 
insertion algorithms presented in this section. The answer 
is no, but it is possible that the deletion algorithm could 
make the table better-than-random, as discussed at the 
end of the last section. Finding good deletion algorithms 
for early-insertion and varied-insertion as well as for 
late-ihsertion is a challenging problem. 

7.2 Coalesced Hashing with Buckets 

Hashing is used extensively in database applications 
and file systems, where the hash table is too large to fit 
entirely in main memory and must be stored on external 
devices, like disks and drums. The hash table is sectioned 
off into blocks (or pages), each block containing b rec- 
ords; transfers to and from main memory take place a 
block at a time. Searching time is dominated by the 
block transfer rate; now the object is to minimize the 
expected number of block accesses per search. 

Operating systems with a virtual memory environ- 
ment are designed to break up data structures into blocks 
automatically, even though it appears to the programmer 
that his data structures all reside in main memory. Linear 
probing (see Sec. S) is often the best hashing scheme to 
use in this environment, because successive probes occur 
in contiguous locations and are apt to be in the same 
block. Thus, one or two block accesses are usually suf- 
ficient for lookup. 

We can do better if we know beforehand where the 
block divisions occur. We treat each block as a large 
table slot or bucket that can store b records. Let AT be 
the total number of buckets. The following modification 
of Algorithm C appears in [7], 

To process a record with key K, we search for it in 
the chain of buckets, starting at bucket hash(K). After 
an unsuccessful search, we insert the record into the last 
bucket in the chain if there is room, or else we store it in 
some nonfull bucket and link that bucket to the end of 
the chain. We can speed up this last part by maintaining 
a doubly linked circular list of nonfull buckets, with a 
"roving pointer" marking one of the buckets. Each time 
we need another nonfull bucket to store a collider, we 
insert the record into the bucket indicated by the roving 
pointer, and then we reset the roving pointer to the next 
bucket on the list. This helps distribute the records 
evenly, because different chains will use different buckets 
(at least until we make one loop through the available- 
bucket list). When the external device is a disk, block 
accesses are faster when they occur on the same cylinder, 
so we should keep a separate available-bucket list for 
each cylinder. 

Record size varies from application to application, 
but for purposes of illustration, we use the following 
parameters: the block size B is 4000 bytes; the total 
record size R is 400 bytes, of which the key comprises 7 
bytes. The bucket size b is approximately B/R = 10. 
When the size of the bucket is that small, searching in 
each bucket can be done sequentially; there is no need 
for the record size to be fixed, as long as each record is 
preceded by its length (in bytes). 



Deletions can be done in one of several ways, anal- 
ogous to the different methods discussed in the last 
section. In some cases, it is best merely to mark the 
record as "deleted," because there may be pointers to the 
record from somewhere outside the hash table, and 
reusing the space could cause problems. Besides, many 
large scale database systems undergo periodic reorgani- 
zation during low-peak hours, in which the entire table 
(minus the deleted records) is reconstructed from scratch 
[15]. This method has not been analyzed analytically, 
but it seems to have great potential. 

73 Hash Tables Within a Hash Table 

When the record size R is small compared to the 
block size B, the resulting bucket size b x B/R is 
relatively large. Sequential search through the blocks is 
now too slow. (The block transfer rate no longer domi- 
nates search times.) Other methods should be used to 
organize the records within blocks. 

This is especially true with multiattribute indexing, in 
which we can look up records via one of several different 
keys. For example, a large university database may allow 
a student's record to be accessed by specifying either his 
name, social security number, student I.D., or bank 
account number. In this case, four hash tables are used. 
Instead of storing all the records in four different tables, 
we let the four tables share a single copy of the records. 
Each hash table entry consists of only the key value, the 
link field, and a pointer to the rest of the student record 
(which is stored in some other block). Lookup now 
requires one extra block access. Continuing our numer- 
ical example, the table record size reduces from R = 400 
bytes to about R = 12 bytes, since the key occupies 
7 bytes, and the two pointer fields presumably can be 
squeezed into the remaining 5 bytes. The bucket size b 
is now about B/R as 333. 

In such cases where b is rather large and searching 
within a bucket can get expensive, it pays to organize 
each bucket as a hash table. The hash function must be 
modified to return a binary number at least flog M ,m \ 4- 
flog 6] bits in length; the high-order bits of the hash 
address specify one of the M' buckets (or blocks), and 
the low-order bits specify one of the b record positions 
within that bucket. Coalesced hashing is a natural 
method to use because the bucket size (in thi; example, 
b s= 333) imposes a definite constraint on the number of 
records that may be stored in a block, so it is reasonable 
to try to optimize the amount of space devoted to the 
address region versus the amount of space devoted to the 
cellar. 

7.4 Dynamic Hashing 

So far we have not addressed the problem of what to 
do when overflow occurs — when we want to insert more 
records into a hash table that is already full. The common 
technique is to place the extra records into an auxiliary 
storage pool and link them to the main table. Search 
performance remains tolerable as long as the number of 
insertions after overflow does not get too large. (Guibas 
[4] analyzes this for the special case of standard coalesced 
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hashing.) Later during the off-hours when the system is 
not heavily used, a larger table is allocated and the 
records are reinserted into the new table. 

This strategy is not viable when database utilization 
is relatively constant with time. Several similar methods, 
known loosely as dynamic hashing, have been devised 
that allow the table size to grow and shrink dynamically 
with little overhead [3, 8, 9]. When the load factor gets 
too high or when buckets overflow, the hash table grows 
larger and certain buckets are split, thereby reducing the 
congestion. If the bucket size is rather large, for example, 
if we allow multiattribute accessing, then coalesced hash- 
ing can be used to organize the records within a block, 
as explained above, thus combining this technique with 
coalesced hashing in a truly dynamic way. 



8. Conclusions 

Coalesced hashing is a conceptually elegant and ex- 
tremely fast method for information storage and re- 
trieval. This paper has examined in detail several prac- 
tical issues concerning the implementation of the 
method. The analysis and programming techniques pre- 
sented here should allow the reader to determine whether 
coalesced hashing is the method of choice in any given 
situation, and if so, to implement an efficient version of 
the algorithm. 

The most important issue addressed in this paper is 
the initialization of the address factor ft. The intricate 
optimization process discussed in Sec. 4 and the Appen- 
dix can in principle be applied to any implementation of 
coalesced hashing. Fortunately, there is no need to un- 
dertake such a computational burden for each applica- 
tion, because the results presented in this paper apply to 
most reasonable implementations. The initialization /? 
» 0.86 is recommended in most cases, because it gives 
near-optimum search performance for a wide range of 
load factors. The graph in Fig. 2 makes it possible to 
fme-tune the choice of /?, in case some prior knowledge 
about the types and frequencies of the searches is avail- 
able. 

The comparisons in Sec. 5 show that the tuned 
coalesced hashing algorithm outperforms several popular 
hashing methods when the load factor is greater then 0.6. 
The differences are more pronounced for large records. 
The inner search loop in Algorithm C is very short and 
simple, which is important for practical implementations. 
Coalesced hashing has the advantage over other chaining 
methods that it uses only one link field per slot and can 
achieve full storage utilization. The method is especially 
suited for applications with a constrained amount of 
memory or with the requirement that the records cannot 
be relocated after they are inserted. 

In applications where deletions are necessary, one of 
the strategies described in Sec. 6 should work well in 
practice. However, research remains to be done in several 
areas including the analysis of the current deletion algo- 



rithms and the design of new strategies that hopefully 
will preserve randomness. The variant methods in Sec. 
7 also pose interesting theoretical and practical open prob- 
lems. The search performance of varied-insertion coa- 
lesced hashing is slightly better than Algorithm C; re- 
search is currently underway to analyze its performance 
and to determine the optimum setting /? op t. One excit- 
ing aspect of coalesced hashing is that it is an extreme- 
ly good technique which very likely can be made even 
more applicable when these open questions are solved. 



Appendix 

For purposes of average-case analysis, we assume 
that an unsuccessful search can begin at any of the M 
address region slots with equal probability. This includes 
the special case of insertion. Similarly, each record in the 
table has the same chance of being the object of any 
given successful search. In other words, all searches and 
insertions involve random keys. This is sometimes called 
the Bernoulli probability model 

The asymptotic formulas in this section apply to a 
random AT -slot coalesced hash table with address region 
size M = r/M#'l and with N - faA/'l inserted records, 
where the load factor a and the address factor £ are 
constants in the ranges 0 < a < 1 and 0 < ft < 1. Formal 
derivations are given in [10, 11, 13]. 

Number of Probes Per Search 

The expected number of probes in unsuccessful and 
successful searches, respectively, as M ' — ► oo is 

a if a < A/2 



C' N (M',M) ~* 



8 a 



if a > A/8 (Al) 
if «S A/? 



C N (M', M) ~i 



^(^-x._ 1 _ 2 ^_ x ^ 

i (-9 



if a> A/? (A2) 



where A is the unique nonnegative solution to the equa- 
tion 



924 



Communications 
of 

the ACM 



December 1982 
Volume 25 
Number 12 



(A3) 



For each address region size M, the average number 
of probes per search is maximized when the table is full. 
Figure 10 graphs these maximum values C f Ar(M', M) 
and Cm (M\ M) as a function of the address factor fi. 
The choice fi « 0.782 yields the best bound on the 
number of unsuccessful search probes, namely, Cm (M\ 
M) a 1.79. If we set fi s 0.853, we get the corresponding 
successful search bound, which is Cm* (AT', M) =; 1.69. 

Although formulas (Al) and (A2) appear a bit for- 
midable at first, they carry a good deal of intuitive 
meaning. The variable A is defined to be L/M, where L 
is the average number of inserted records when the cellar 
first gets full. The condition a<Xfi means that with high 
probability the cellar is not yet full. In this case, the 
chains have not coalesced, so the formulas are identical 
to those for separate chaining, which are given in [7]. In 
the other situation, when a > A/? and the cellar is full 
with high probability, the formulas are structurally sim- 
ilar to those for the standard coalesced hashing method, 
given in [7]. At the crossover point a = Xfi, both cases of 
each formula are equal, as can be seen by applying (A3). 

The procedure used in Sec. 4 to optimize the number 
of probes per search for a fixed load factor a makes use 
of the fact that the optimum address factors /? opt which 
minimize (Al) and (A2) are located somewhere in the 
"a > A/3" region. 

In other words, the best address region size M is 
always large enough so that on the average some amount 
of coalescing will occur, as we saw in Sec. 1. This can be 
proved as follows. Inspection of (A3), which relates fi 
and A, shows that if fi increases, then A decreases, and 
vice versa. The derivative w.r.t. A of the expression A/? 



Fig. 10. The average number of probes per search used by Algorithm 
C in a full table, as a function of the address factor £. 




= X/(e~ x + A) is positive, so fi increases when A/? de- 
creases, and vice versa. This means that the value of fi at 
the cutoff point a = A/? is the largest value of fi in the "a 
< A/?" region. (The load factor a is fixed in this analysis.) 
Since the derivatives w.r.t. fi of Cn(M\ M) and Cn(M\ 
M) are both negative when a < A/?, a decrease in fi 
would increase C'n(M', M) and C N (M', M). Therefore, 
the minima for the "a < A/?" case both occur at the 
largest possible value for fi, namely, the endpoint where 
a - Xfi, which is also part of the "a > \fi" region. It also 
turns out that both formulas (Al) and (A2) are convex 
w.r.t. fi, which is also needed for the optimization process 
in Sec. 4. 

Mix Running Times 

The mix running time for unsuccessful searches, 
which we get by substituting 5=51=0 into (2), is given 
by 



(7C + 4A + 17)u 



(A4) 



where u is the standard unit of time for mix computers. 
The average value of Cis C'n(M\ M), which is calculated 
in (A 1). It is convex w.r.t. fi and achieves its minimum 
somewhere in the "a > Xfi" range. The expected value 
of a is equal to 1 - E N /M, where E N is defined to be the 
average number of empty slots in the address region 
after N records have been inserted. We have 



E N ~ h 



e~^M 
1 -a 
I fi 



M 



ifa< \fi 
if a ^ Xfi 



where A is defined by (A3). The average value of A 
achieves its minimum w.r.t. fi at the cutoff point a = 
Xfi, where its derivative w.r.t. fi is discontinuous. The 
derivative is negative when a^Xfi and positive when a 

> Xfi. Since fi and Xfi vary in opposite directions, this 
shows that (A4) is minimized at some point fi in the "a 

> Xfi" region. (In fact, for a < 0.5, the minimum occurs 
at the endpoint a — Xfi.) The optimization in Sec. 4 
depends on the empirically verified fact that for each 
fixed value of a, formula (A4) is well-behaved. By that 
we mean that (A4) is minimized at a unique fi opt , which 
occurs either at the endpoint a = Xfi or at the unique 
point in the w a > Xfi" region where the derivative is 0. 

The value of A is always 1 in successful searches, 
since only occupied slots are probed. Substituting A = 1 
and S = 1 into (2), we obtain the following formula for 
the mix successful search time: 



(7C+ 18 + 251)i/ 



(A5) 



0.4 0.5 0.6 
Addrcubctor. fi 



Here the averge value of C is Cs{M\ M) y which is given 
in (A2). It is convex w.r.t. fi and is minimized at some 
point in the "a > Xfi" range. The expected value of 51 
is the average number of records that do not collide when 
inserted, divided by N. That is equal to 
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if a < X/? 



if a> 



where A is defined by (A3). The expected value of S 1 
attains its maximum in the "a > A/T region. This gives 
us no clue as to whether (A5) is well-behaved in the 
above sense, which is assumed by the optimization pro- 
cedure, but numerical study verifies that it is true. 

Addendum: Since the time this article was submitted, 
one of the open problems mentioned in Sec. 7 has been 
solved by Wen-Chin Chen and the author. The analyses 
of the search times of early-insertion and varied-insertion 
coalesced hashing appear in "Analysis of Some New 
Variants of Coalesced Hashing," Department of Com- 
puter Science, Brown University, Technical Report No. 
CS-82-18, June 1982. 
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abstract When open addressing is used to resolve collisions in a hash table, a given set of keys may be 
arranged in many ways, typically this depends on the order in which the keys are inserted It is shown that 
arrangements minimizing either the average or worst-case number of probes required to retrieve any key in 
the table can be found using an algorithm for the assignment problem. The worst-case retrieval time can be 
reduced to 0(loga(A/)) with probability 1 - e( M) when storing M keys in a table of size M , where c(Af) -*> 0 
as M oo We also examine insertion algorithms to see how to apply these ideas for a dynamically changing 
set of keys s 

key woads and phrases hashing, collision resolution, searching, assignment problem, optimal algorithms, 
database organization 
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"Spread the table and contention will cease " Old English proverb [11, #272 6] 
1. Introduction 

We consider schemes to optimize the placement of keys in a hash table when open 
addressing is used to resolve collisions. More precisely, we begin with the- observation 
that a given set of keys may be inserted into a hash table in many different orders, 
yielding arrangements of the keys in the table of varying efficiency. Typically, the user 
has no control over the order in which the keys are inserted; he must accept them in 
the order in which they arrive. However, the previous observation that there exist 
many different arrangements of the given set of keys raises the following questions: 

(1) How can one determine that arrangement which minimizes either the average or 
worst-case number of probes to retrieve a key in the table? In Section 2 we show that 
this problem is an instance of the well-known "assignment problem," for which efficient 
algorithms exist. 

(2) What is the expected value of the worst-case number of probes required to 
retrieve a key from a full table that has been optimally arranged using the assignment 
algorithm? In Section 3 it is proved that this value is 0(Iog 2 (M)) for a table of size M 
containing Mkeys. The proof is modeled on a result by Erdos and Renyi [2] concerning 
the permanent of a random matrix. This result demonstrates that we can 
use hashing to achieve "good" (i.e. 0(log 2 (Af ))) worst-case performance if we take the 
time to optimize the arrangement of the keys in the table. Traditionally hashing has 
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been viewed as excellent on the average, but horrible in the worst case. We see 
therefore that this need not be so. 

(3) The results mentioned above require that an A/xAf assignment problem be 
solved to optimize the placement of M keys in a table of size Af . A natural question to 
ask is, "Is it possible to solve the assignment problem efficiently 'incrementally/ so that 
the new keys can be added to the table in such a way that the optimality of the 
overall arrangement is maintained?" In Section 4 this problem is studied and it is 
shown that for table densities less than approximately 0.415, it is possible to insert a 
key and maintain overall optimality by solving an assignment problem no larger than 
10x10, whereas for larger densities the entire AfxM assignment problem must 
apparently be solved. 

Overall, we view the contribution of this paper to be the introduction of the 
assignment algorithm for the placement of keys in a hash table, and the demonstration 
that efficient worst-case retrieval can be achieved thereby, even in a full table. 

We proceed now to define our terminology and to introduce the "standard" algorithm 
for inserting a key into a hash table. Let 3C = {K l9 K 2 , , K N ) be a set of N keys, and 
let an array T, for 1 < / < Af be a set of Af memory locations (the hash table) which 
will be used to store 5Sf. Each table position may hold either a single key or the special 
symbol empty. We assume N ^ M. When open addressing is used to resolve collisions a 
"hashing function" h: U x {1, 2, ... , Af} {1, 2, ... , Af} is used, mapping the set U of 
all possible keys (that is, 3C may be any W-subset of U) and probe numbers into the set 
of memory locations. We assume for any key K E U that the sequence h(K 9 1), h(K, 2), 
... , h(K, AO is a permutation of {1, 2, .. , Af} To store the key K in the table using 
the standard insertion algorithm the locations T hiKtl)f T MKa) , .. are successively examined 
until an empty location is found or until K is found already present in the table. The 
following program makes this precise. 

THE "STANDARD" INSERTION ALGORITHM 

Input' A key K t a hash table 7, a hash function h 

Output None T is modified to contain K, unless K is already present 

Procedure 

7=0, 

repeat; =y -f 1, 

i =h(K t j), 

if T, = empty then T, ~ K 
until 7\ = K> 

Note that T must contain at least one empty location if K is not already in the table, 
if the loop is to terminate properly The value of / at termination, which is the number 
of probes required to insert K 9 is taken to be the cost of inserting K. 

A similar procedure searches for the presence of a key K in T (replace the assignment 
statement C T, := K" by "return (K not present)") If the repeat loop terminates 
normally then T t contains the previously stored key K> The value of / at termination is 
taken to be the cost of searching for K. 

Knuth [6] studies hashing algorithms in detail, giving alternative methods for handling 
"collisions" (the case when h(K„ 1) - h(K„ 1) for K, ^ K } ) and several open-addressing 
hash functions h . The reader who is unfamiliar with hashing algorithms should find it 
profitable to consult his text. 

2 Optimal Arrangements 

In this section we give precise definitions of when an arrangement minimizes the 
average or worst-case retrieval time, and then show that there always exists some 
ordering such that if the keys had been inserted by the standard algorithm in that 
order, the optimal arrangement results Then it is shown that the assignment algorithm 
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can be used to arrange the keys so as to minimize either the average or worst-case 
retrieval time. 

The arrangement of the keys X in the hash table depends on the order in which they 
were inserted, if the standard insertion algorithm is used. For example, let U be the set 
of natural numbers and let h(K, j) be the jth decimal digit of K. Inserting the set X = 
{1423, 1234, 3412, 2341} into an empty table in that order results in the arrangement 
a: 

Location: 12 3 4 

Contents: 1423 1234 3412 2341 

whereas inserting them in the order 1234, 2341, 1423, 3412 results in a': 

Location* 12 3 4 

Contents 1234 2341 3412 1423 

Let a:X-+ {1, 2, ... , M} be called an arrangement ; a(K t ) - / means that T s = K x . Of 
course a must be one-to-one. Let^4(3Sf, M) denote the set of all arrangements of X in 

Let p(K, a) denote the number of probes required to retrieve a key K under 
arrangement a; the average avg(a) = (l/N) 2> e * p(K, a) and worst-case wc(a) = 
max{p(AT, a)\K e X) number of probes to retrieve any key in Tare then definable. We 
have avg(a) = 7 A, wc(a) = 3, avg(a') = 5 U, and wc(a') = 2 in the above examples. 

Define an arrangement a € A(X, M) to be valid if all the positions h(K, 1), h(K 9 2), 
... , h{K, p (JSC, a) - 1) are nonempty for every key K in 3ST- An arrangement is valid iff 
every key K in % is retrievable using the search algorithm of Section 1 Similarly define 
an arrangement to be feasible if it is the result of inserting the keys in X into an empty 
table sequentially in some order; necessarily every feasible arrangement is valid. 

Valid arrangements which are not feasible are possible; consider the following 
arrangement using the hash function h from our previous example: 

Location- 1 2 3 4 

Contents' empty empty 4321 3412 

The number of feasible arrangements depends on X and h. It is no larger than N\ (the 
number of ways to enter the keys), but may be as low as 1 if no collisions occur. 
Similarly the number of valid arrangements can vary between 1 and NX. For example, 
only one valid arrangement exists if no collisions occur and h(K t7 1) =h h(K 3 , 2) for all 
K n Kj in X. The upper bound of N\ on the number of valid arrangements is obtained 
by induction on N> using the fact that p(K> a) ^ N for any valid arrangement and all 
keys K E X . We may store K s in any of N positions h(K N , i) for 1 < i £ N; if we then 
delete K N from X and h(K Nt i) from the probe sequence h(K„ 1), ... , h(K„ M) for 
every / < N we see that every valid arrangement of X induces a valid arrangement of 
X-{K N } in locations {j\ 1 =s / Af and ; 4 h(K N9 /)} using the modified probe sequences. 

We define an arrangement a(X, M) to be optimal if either avg(a) or wc(a) is 
minimal over all arrangements in A(X, M); the terms average-optimal and worst-case- 
optimal will distinguish these cases. 

Proposition 1 . A feasible optimal arrangement always exists. 

Proof If a minimal arrangement a is not feasible, then there exists a set {K^ y K h , 
... , K^} of keys, none of which can be entered first since they form a "blocking 
cycle": There is a set of integers t 3 for 0 £ / ^ r - 1 such that h{K h ,p(K %i > a)) = 
*(*Wmodr> 'o+i)mod r) and h < p(K ljy a) fwOs/Sr-1. But clearly p(K tji a) can be 
reduced by setting a(K %) ) to h{K Xj >t } ) for 0 ^ r — 1. Since avg(a) strictly decreases, 
a feasible optimal arrangement can always be found after a finite number of blocking 
cycles have been removed in this fashion. □ 

Proposition 1 suggests an algorithm for finding optimal arrangements: enumerating 
all feasible arrangements; however, better methods exist. 
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Proposition 2 Optimal arrangements can be found by using an algorithm for the 
assignment problem 

Proof. The assignment problem [7] can be stated as follows. 

Let N and M be given, with N s M, and let {a l} \ 1 =£ / N, 1 £ / ^ Af\ be a matrix of 
nonnegative real numbers. The classic example specifies for each of M men and N jobs, 
the "inefficiency" a 0 of man ; in job / . The objective is to find an assignment / a(i) of 
jobs to men such that the sum 2 J:SIsA r fl,,a<i> is minimized, subject to the constraint that 
no man is assigned to more than one job. 

We can apply this directly to the problem of finding average-optimal arrangements 
by letting a u be the integer such that h(K t , a^) = /, denoting the cost of assigning K x to 
T } . The average number of probes required to retrieve a key in the optimized table is 
then just the total "inefficiency" divided by N We observe that if the various keys have 
associated retrieval probabilities, then the arrangement that minimizes the expected 
retrieval cost can be found in the same manner; we need only multiply each a v by the 
probability that K, will be retrieved. 

Similarly, we can minimize the worst-case cost by choosing a 0 to be JV', where / is the 
integer such that h(K u I) = Since the key with highest cost determines the order of 
the total cost, minimizing the total cost here minimizes the worst-case cost. □ 

Having observed that our problem can be formulated as an instance of the assignment 
problem, it is of interest to know how quickly a solution can be determined. The 
general NxM assignment problem can be solved in time 0{NhP) [8]; the space 
required is 0{N + M) if the matrix entries a„ can be computed in constant time from 
K„ h, and j. When all the matrix entries are small integers (as when we are finding the 
average-optimal arrangement), it may be possible to improve this time bound somewhat, 
but the author was unable to find a more efficient procedure. 

Worst-case optimal arrangements can be determined in time 0(BM(M, iV)-log 2 (A0), 
where BM(M> N) is the time required to solve an MxN bipartite matching problem. 
The procedure, pointed out to the author by Vuillemin, is to use binary search on the 
worst-case cost: It is possible to test if the optimal worst-case cost is less than or equal 
to a given value w by solving the corresponding maximal matching problem. The graph 
used has N vertices x n M vertices y„ and an edge (x ly y s ) iff a^ ^ w. Intuitively, there is 
an edge from jt t to y } if and only if table position T s is one of the first w positions in the 
probe sequence for K v There will be a matching of size N in this graph if and only if 
there ts an arrangement of the keys in the table such that every key can be retrieved 
with no more than w probes. Since BM(M, M) = 0(M* 5 ), we obtain an 0(M* 5 log(Af)) 
algorithm for the case N = M. 

3 Efficiency of the Worst-Case Optimal Arrangements 

In this section we prove that even if the hash table is full (N = Af), we can expect the 
worst-case optimal arrangement to have a worst-case cost of 0(Iog(Af)) with a probability 
approaching one very rapidly as M —> «>. Although a worst-case cost of 0(log(Af)) can 
obviously not be guaranteed (since there is a finite chance that all keys have the same 
probe sequence, for example), the odds are overwhelming that with a random hash 
function and a random set of keys, there is some arrangement of those keys yielding a 
worst-case cost of 0(log(A/)) . This compares favorably with standard techniques such as 
binary search trees which also require 0(log 2 (AO) time to retrieve a key, especially in 
situations where the set of keys is static (since updating an optimized hash table can be 
expensive). 

The proof is modeled very closely after a similar result of Erdos and Renyi [2], who 
show that a random nxn matrix of 0's and l's containing N(n) l's has a nonzero 
permanent with probability approaching 1 as n -» « if hm n ^ m (N(n) - log(n))/n = ». 
The permanent of an nxn matrix {a ti } is defined to be %a Ul a 2ll a„ v where the 
summation is over all permutations (i l9 ... , i n ) of {1, ... , n}. The permanent of a 0-1 
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matrix {a J is the number of matchings of size n in a bipartite graph whose adjacency 
matrix is {a u } Ryser [10] discusses the permanent in some detail 

Let M(M, N, w) denote the set of all 0-1 matrices with Af columns, N rows, and 
exactly w l's per row. Obviously \M(M, N, w)\ = We say a matrix {wj G M(M 9 
N 9 w) contains N independent l's iff there exists a function a:{l, ... , AT} {1, ... , Af} 
such that a(i) =/= a(j) for i + j and m tMt} = 1 for 1 < i < AT. Let P(Af , N, w) denote the 
probability that a matrix in M(M 9 N 9 w) contains N independent ones. 

The interpretation to matrices of M(M , N 9 w) is as follows. Each such matrix has N 
rows (corresponding to a set of N keys) and Af columns (one for each position in the 
hash table) Position t 9 j will be a 1 iff key / can be stored in position / with a retrieval 
cost of w or less. Therefore each row has exactly w l's Such a matrix is the adjacency 
matrix of one of the bipartite graphs described in the last paragraph of Section 2 A 
matrix in M{M 9 N, w) will have N independent ones iff its corresponding bipartite 
graph has a matching of size N. This will happen iff there exists an arrangement of the 
keys so that every one can be retrieved with w probes or less. 

We identify /'(Af, N, w) with the probability that a random set of N keys can be 
arranged in a hash table of size Af so that the worst-case retrieval cost is at most w. This 
will be accurate if every set of w locations is equally likely to be the set of w locations 
first probed for a random key k This will happen, for example, if every permutation of 
{1, . , Af} is equally likely to be a probe sequence. Each matrix in M{M> N y w) then 
corresponds in a natural fashion to the characteristic matrix describing, for a random 
set of N keys, which locations are usable if the worst-case cost is constrained to be at 
most w The existence of N independent l's corresponds to the existence of an 
arrangement with worst-case cost of at most w ; and by Proposition 1 the existence of a 
feasible, valid arrangement with worst-case cost at most w is thereby implied. 

We have P(M , N 9 w) > P{M , Af, w) for 1 < N < M since the first N rows of a matrix 
in MHJAy Af, w) which contains Af independent Ts must contain N independent l's. We 
therefore proceed to show the following. 

Proposition 3. /i/n^ 00 P(Af, M, 4 log(M)) = 1. 

Proof. This result says that we can expect to find an arrangement of Af keys in a 
table of size Af such that no key requires more than 4 log(M) probes to be retrieved. By 
the theorems of Frobenius [3] and Konig [7], 1 - P(Af, Af, w) is equal to the 
probability that a matrix in Jl(M, Af, w) has k rows (or columns) and Af - k - 1 
columns (or rows) that contain all the l's, for some At, O^fc^Af-1. (The result of 
Frobenius and Konig says that in an AfxAf matrix of 0's and l's the minimal number of 
lines (i.e. rows or columns) which contain all the l's is equal to the size of the 
maximum set of l's which can be found which are pairwise independent (no two in the 
same line).) Thus 1 - P(Af, Af, w) is the probability that there are Af - 1 or fewer lines 
which contain all the l's. 

Let Qjt(Af , N, w) denote the probability that a matrix in ^(Af , N, w) has k rows (or 
columns) and N - k - 1 columns (or rows) containing all the l's, and k is the least 
such number for 0 £ k ^ Af/2. Then 

l-P(M,N,w)= £ Qk(M 9 N 9 w). 

We show that for all fc, 0 < k < [m/2\> if w > 4 log 2 (Af) then Q k (M , Af, w) -» 0. To 
do this we divide Q k into two parts, 

C*(Af, Af , w) = / fc (Af , Af, w) + g fc (Af, Af, w), 

where f k is the probability that k rows and Af - k - 1 columns cover all the l's andg* 
is the probability that k columns and Af - k - 1 rows cover all the l's (k is each case 
being minimal). 

Case 1. k rows and Af - k - 1 columns contain all the l's, for some k £ Af/2. 
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Those matrices in Jl(M, Af , w) having a minimal number k of rows and Af - k - 1 
columns containing all the l's can be displayed as in Figure 1, after an appropriate 
permutation of the rows and columns Each row of submatrix B must contain two l's 
under our assumption that k is minimal (if not, we could include the column, and 
exclude the row, of the 1 in matrix B which is in a row of B containing no other l's). The 
fraction fk( Af, Af, w) of matrices of this type is less than 



-<-«C;*r 1 ))V(")" 



whose logarithm is bounded above by 

[{2k + 1) - w(M - k)] log(Af) + w(M - Jfc)log(Af - k - 1) 

- k \og(k) - (k + l)log(Jfc + 1) < (2k + l)log(Af) - w(k + l)/2. 

Thus if w 2= 4 log(Af), fi fc (Jlf , M, n>) — ► 0 as Af ». 

Case 2 jfc columns and Af - it - 1 rows contain all the l's, for some k ^ Af/2 
(Figure 2). 

The fraction g k (M , M, w) of matrices of this type is less than 



(?)U)(*rt) 



whose logarithm is bounded above by 

(2k + l)log(Af) - w(k + l)log(w), 

so that g k (M, M,w)-^>0 with M if w = 2 log(Af). Since Q k (M, M, w) = / A (Af, Af, iv) + 
g k (Af, Af, w), we are finished with the proof. □ 

This result says that in a full table arranged so as to minimize the worst-case retrieval 
time, the worst-case retrieval time should be 0(log(Af)). This follows from Proposition 
3 since the existence of a set of Af independent l's in a matrix in P(Af, Af, w) 
corresponds to an arrangement of Af keys in a table of size Af with worst-case retrieval 
time no more than w . This result is the best possible (up to a constant multiplicative 
factor) due to a result of Gonnet [4J: The worst-case retrieval time must be at least 
ln(Af) + O(l). 

A study of the related question of the expected value of the average number of 
probes required to retrieve a key in a full table which is average-optimal is given in [5J. 
(Less than two probes per key are required.) 

4. Insertion Algorithms Which Maintain Oplimality 

We now turn our attention to the problem of maintaining the optimality of an 
arrangement as new keys are inserted into a table. The main result of this section is 
that if the table is not too densely filled, then a new key can be inserted into the table 
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and the new optimal arrangement computed by solving a small (e.g 10x10) assignment 
problem. This result is obtained by a rather complicated analysis using generating 
functions. 

We first examine an insertion algorithm due to Brent [1] and demonstrate that it 
does not maintain optimality. Of course, Brent only intended his algorithm to be a 
good heuristic, a means of inserting each new key in such a fashion that the increase in 
average retrieval cost is kept reasonably low 

Brent's algorithm works as follows. Let K denote the new key being inserted, and 
suppose positions h(K, 1), ... , h{K, s) are already occupied with keys K u K 2 , ... , K Si 
and that T hiK ~ ¥l) is empty. Let r t denote the number of probes required to retrieve K n 
so that h(Kt, r,) = h(K> i). Furthermore, let s t denote minfy\T hiKtj) = empty}, the 
number of probes required to retrieve K t if we move it to position h(K iy s t ). Then (t + 
($t ~~ fx))/(N + 1) is the increase in the average retrieval cost caused by moving K x to 
position h(K ty s t ) and storing K in position h{K> i). Brent chooses between storing K in 
position h(K, s + 1) and moving that K which minimizes i + (s, - r) by comparing (s + 
1) to min,{i + s t - r,}. 

In fact, the following example demonstrates that no algorithm which only moves keys 
forward in their probe sequence (that is, moves K from h{K, i) to h(K 9 /') for /' > i) can 
always arrive at the optimal arrangement. Consider the following arrangement (using 
the hash function of our previous examples), which is both average and worst-case 
optimal: 

Location 1 2 3 4 5 6 7 

Contents 1273456 1234567 3456712 4567123 5671234 6712345 empty 

If the key 2345671 is now inserted, the only way to maintain optimality is to move 
1273456 to location 7, move 1234567 (backward) to position 1, and then store 
2345671 in position 2 

Since Brent's algorithm is the only published algorithm which moves previously 
inserted keys when inserting a new key, we see that no existing insertion algorithm can 
maintain optimality for arbitrary hash functions. It is interesting to note, however, that 
for certain open-addressing collision-resolution schemes the standard insertion algorithm 
maintains average-optimality. We say that a hash function h exhibits primary clustering 
\ih(K»j) = h(K t ,j') implies that /i(/C f , / + /) = h(K t , 9 j' + /) for 0 < / < M - min(/,y') 
for any K n K t *. Linear probing (h(K 9 i) s h(K, 1) + (i - 1), mod M) is perhaps the 
best-known example of a collision-resolution scheme exhibiting primary clustering, and 
all primary clustering schemes are in fact isomorphic to linear probing in a natural 
manner 

Proposition 4. If h exhibits primary clustering, then the usual insertion algorithm 
maintains average-optimality 

Proof. This theorem is due to Peterson [9]; the proof is also given in Knuth [6, p. 
531]. Knuth also remarks that if the keys have associated retrieval probabilities, then 
the average-optimal arrangement can be achieved by using the standard insertion 
routine to insert the keys one by one into the table, in order of decreasing request 
probabilities. □ 

In spite of the fact that for linear probing the standard insertion algorithm maintains 
average-optimality, other hashing schemes are to be preferred, since the expected 
retrieval cost in the average-optimal scheme for a primary-clustering hashing function 
generally exceeds the expected cost for other schemes, even if average-optimality is not 
maintained. 

We now turn our attention to the task of finding an insertion algorithm that will 
maintain the optimality of an arrangement. In essence, we need an algorithm to solve 
the assignment problem "incrementally. " 

One approach is to observe that if N/M is small enough (how small this is we shall 
determine), then the number of keys already in the table which we need to consider 
moving might be reasonably small. Brent considers moving only those keys on the 
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probe sequence of the new key K\ if we also consider moving all of the keys on their 
probe sequences, and so on, we can determine the maximum set SP of keys that might 
need to be moved. Similarly we let & denote the set of locations that Sf might occupy in 
the optimized table; it suffices then to solve the assignment problem for placing & into 
J, rather than X U {K} into T. 
Define, for a given arrangement a, the functions: 

tt(K) = mit){j\h(K 9 ]) = empty}, 

<r(K) = {K t | a{K t ) = h{K f j) for some j < <n(K)}> 

t(K) = {i\h{K, j) = i for some j < ir(K)}. 

Then 

9-(K) = r(K) U {STiK^K, 6 a(K)} 

define by means of their minimal solutions the sets Sf and & of keys and positions 
relevant to the insertion of K into an arrangement a 

Let p = N/M denote the "loading factor" pf the existing arrangement a. In order to 
estimate the expected size S^(K), we assume that the hashing function is uniform in the 
sense that every permutation of {1, . , Af} is equally likely to be a probe sequence of 
some key K. We can then use the approximation ProbMK) = i) = (1 - /J)/?" 1 

Lets, denote the probability that \&{K)\ = i, and let 

00 

S(z) = X^z' 

1=1 

denote the corresponding generating function. We shall develop an equation for S(z) 
which depends on the generating function: 

00 
1 = 1 

(where p x is the probability that, for a key K' already stored in T, oc(K') ~ h{K\ /)). 
However, determining P(z) for optimized hash tables remains an open problem, so we 
shall approximate S(z) after we develop the correct defining equation. 

Let C(z) = c t z l be the generating function with coefficients c t equal to the 
probability that the "contribution" of a key K on the probe sequence of the new key K 
to S(K) is i keys. Therefore 

00 

S(z) = 2 (1-/3)^(2)]' -z, 

since there is a probability of (1 - f})@ 1 that 7r{K) - i + 1 (that is, there are i keys on 
the probe sequence for the new key K). The final z is for the key itself. 
Similarly we can define 

C(z) = [ £ Pi (C(z)y~* j • [ £ (1 - j3)j3'(C(z))<] -z 

(or equivalently, 

(1 - /JC(z))-(C(z)) 2 = (1 - (3)P(C(z))z). 

The first term accumulates the contributions of those keys K' on the probe sequences 
of a key K* on the probe sequence for K, such that K' occurs before K r in the probe 
sequence for K* The second term adjusts for those keys K' occurring after K' in the 
probe sequence for K\ Finally, the third term z is for the key K' itself. 
The expected size of Sf{K) is S'(l); and 

s < ( ) = d ( " ® z ) „ (1 - PC(z))(l - /?) + (1 - (i)z(3C'(z) 
W dz \(1 - j3C(z))/ (1 - 0C(z)) 2 
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so that 



Now 



S'U) « 1 + 



(1 - ftf 



(1 - fiC(z))lC(z)C\z) - pC(z){C(z))* = (1 - p£)[F(z)C(z)z + P(C(z))} 
so we obtain 

C'(l) = (1 - fl/<2 - 3)8 - (1 - 0)P'(D) 

and thus 

S'(l) = 1 + M2 - 3)3 - (1 - /8)P'(D). 

Unfortunately, -P(z) is unknown. We observe, however, that 5'(1) can be expected to 
remain finite as long as P'(l) s (2 - 3/J)/(l - j8). Since P'(l) > s *e expected number 
of probes required to retrieve a key from an optimized table, it is bounded above by 
the expected number of probes required to retrieve a key from a table organized with 
any open-addressing hashing method. For uniform probing (all probes sequences 
equally likely) we have [6] 

^(D^^logd/d - /J)) 

approximately. Substituting this into the final equation for S'(l) yields Figure 3; we see 
that the size of the relevant assignment problem is reasonably small (say 10 keys or 
less) as long as )3 < 0.4 roughly The function S'(l) has a pole p = 0.41466541; for 
loading densities less than this we can expect the number of relevant keys to be finite. 
In practice we should expect to be able to handle even higher loading densities without 
much trouble, since our formulas for 5, C, and P explicitly ignore the probability of 
overlapping probe sequences Furthermore, replacing P(z) by its correct definition 
(rather than the one for uniform probing) should yield a definite improvement. 
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The result of this rather complicated analysis is that if the loading density of the file 
is less than roughly 0.4 we can hope to insert a new key K into the table by solving a 
small assignment problem. For higher densities the problem is inherently a global one 
apparently; we must consider for relocation a considerable number of keys. 

5. Discussion and Conclusions 

In this paper we have shown how to arrange a set of keys in a hash table so as to 
minimize the expected (or worst-case) number of probes required to retrieve a key. 
Our analysis demonstrates that the worst-case cost can be .reduced to 0(log2(Af)) in 
almost all cases. (In practice it should be possible to achieve 0(log 2 (Af)) in all cases with 
very little work, since a set of keys which has an optimized cost that is too large can, by 
choosing another hash function randomly, be expected to yield an 0{\o%&{M)) cost.) 

Our analysis assumes that uniform hashing is used, however; an open problem is to 
confirm this result for the more common techniques such as double hashing. 

We have also examined briefly a technique for inserting a new key into an optimized 
table so as to maintain optimahty of the arrangement. Our result here is that as long as 
the loading factor is less than 0.41 (approximately), we can usually insert a new key 
and maintain opttmality by solving a small (approximately 10-element) assignment 
problem. For tables of higher density one must apparently solve an assignment problem 
which involves most of the keys previously stored. (By saving the primal and dual 
variables of the previous solution, one can significantly speed up the solution of the 
new problem, but the extra storage required might better be used to store the keys 
themselves, thereby reducing the overall density.) 

The reader is encouraged to consult the excellent article by Gonnet and Munro [5], 
which gives explicit listings of algorithms for optimizing the arrangement of keys in a 
hash table and tight results on the expected number of probes required to retrieve a 
key from an average-optimal table. 

The techniques described here should be most useful when the hash table is relatively 
static, with the number of retrievals considerably exceeding the number of insertions 
Large databases are often of exactly this nature, and frequently utilize hashing 
techniques 
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Abstract — We address the problem of code optimization for 
embedded DSP microprocessors. Such processors (e.g., those in 
the TMS320 series) have highly irregular datapaths, and conven- 
tional code generation methods typically result in inefficient code. 
In this paper we formulate and solve some optimization problems 
that arise in code generation for processors with irregular datap- 
aths. In addition to instruction scheduling and register allocation, 
we also formulate the accumulator spilling and mode selection 
problems that arise in DSP microprocessors. We present optimal 
and heuristic algorithms that determine an instruction schedule 
simultaneously optimizing accumulator spilling and mode selec- 
tion. Experimental results are presented. 

Keywords— code generation, optimization, digital signal pro- 
cessors 

I. Introduction 

An increasingly common micro-architecture for embedded systems 
is to integrate a microprocessor or microcontroller, a ROM and an 
ASIC all on a single IC. Such a micro-architecture can currently be 
found in many diverse embedded systems, e.g., FAX modems, laser 
printers, and cellular telephones. 

The programmable component in embedded systems can be an 
application-specific instruction processor (ASIP), a general-purpose 
microprocessor such as the Sparc, a microcontroller such as the Intel 
8051, or a digital signal processing (DSP) microprocessor such as the 
TMS320C25. This paper focuses on the DSP application domain, 
where embedded systems are increasingly used. Many of these sys- 
tems use processors from the TMS320C2x, 56K or ADSP families, 
all fixed-point DSP microprocessors with irregular datapaths. 

As the complexity of embedded systems grow, the need to de- 
crease development costs and time to market mandates the use of 
high-level languages (HLLs) in programming DSP processors; only 
short, time-critical portions of the program can be assembly-coded. 
Recent statistics from Dataquest support this trend: high-level lan- 
guages (HLLs) such as C (and C++) are gradually replacing assembly 
language, because using HLLs greatly lowers the cost of development 
and maintenance of embedded systems. However, current compilers 
for fixed-point DSP microprocessors generate poor code — thus pro- 
gramming in a HLL can incur significant code performance and code 
size penalties. 

While optimizing compilers have proved effective for RISC pro- 
cessors, the irregular datapaths and small number of registers found 
in DSP processors remain a challenge to compilers. The direct appli- 
cation of conventional code optimization methods (e.g., [1]) has, so 
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far, been unable to generate code that efficiently uses the features of 
fixed-point DSP microprocessors. 

Code size matters a great deal in embedded systems since program 
code resides in on-chip ROM, the size of which directly translates into 
silicon area and cost. Designers often devote a significant of time to 
reduce code size so that the code will fit into available ROM; exceeding 
on-chip ROM size could require expensive redesign of the entire IC 
[6]. As a result, a compiler that automatically generates small, dense 
code will result in a significant productivity gain as well. 

There has been relatively little previous work in the area of code 
generation for DSP processors. Cheng and Lin present methods for 
code generation for the TMS320C40 in [2]. The algorithms they use 
are similar to high-level scheduling and allocation methods. Various 
groups in the hardware design community have recently started work- 
ing on the problem of retargetable code generation for embedded pro- 
cessors [9]; the focus, however, has been on horizontally microcoded 
architectures. 

In this paper, we develop code optimization techniques for DSP mi- 
croprocessors with irregular datapaths that improve code performance 
and reduce code size. Our techniques are applicable to a broad class 
of DSP microprocessors, including those in the TMS320, DSP56K, 
and ADSP series. The optimization problems we target include in- 
struction scheduling and register allocation. We also formulate the 
accumulator spilling and mode selection problems that arise in DSP 
microprocessors. The individual problems of scheduling and register 
allocation have traditionally been solved independently in compilers. 
We present optimal and heuristic algorithms that determine an instruc- 
tion schedule while simultaneously optimizing accumulator spilling 
and mode selection. The experimental results we have obtained show 
significant improvements over existing code generation methods. 

The paper is organized as follows. We use the TMS320C25 to 
illustrate our architecture model in Section II. In Section III we for- 
mulate the mode selection problem. We describe the general register 
allocation problem in Section IV and go on to formulate the accu- 
mulator spilling problem that is specific to DSP microprocessors. A 
branch-and-bound scheduling algorithm that produces a schedule for 
a basic block with a minimal number of accumulator spills and mode 
switches is presented in Section V. Experimental results are presented 
in Section VI. We conclude in Section VII with our ongoing research. 

II. Example: TMS320C25 

We describe Texas Instruments* popular TMS320C25 DSP micro- 
processor [7], highlighting the key features of this irregular architec- 
ture that are not addressed in traditional compiler optimizations. Fig. 1 
shows a simplified model of its datapath. 

The TMS320C25 is an accumulator-based machine. In addition to 
the usual ALU, there is a separate multiplier which takes input from 
the T register and memory and places the result in the P register. With 
the separate multiplier the machine can execute one-cycle multiply- 
accumulate operations. Note that there are no general-purpose regis- 
ters other than the accumulator. Most operations involve an operand 
taken from the memory. 




A-Bus 



-Bus 



Fig. 1. TMS320C25 datapath (simplified model) 

The memory is addressed by the address register file (ARO through 
AR7), which is in turn addressed by the 3-bit address register pointer 
(ARP) (denoting the "current" AR). The address generation unit (AGU) 
allows the current AR to be auto-incremented or auto-decremented 
during the execution of any TMS320C25 instruction that uses indirect 
addressing mode. 

Another feature of the TMS320C25 that is usually missing in gen- 
eral register machines is the use of modes (or residual control in 
microprogramming terminology). The most commonly used mode 
classes are sign-extension and product-shift. Some instructions are 
affected by the setting of the modes, and if the current mode setting is 
different from that desired by the instruction, then it must be set to the 
appropriate values. 

The use of address registers and modes is likely intended to allow 
for more compact code. However, this means that the compiler's job 
is made more difficult. In the subsequent sections we will examine the 
problem of mode settings and present a formulation for the generalized 
problem. We will then extend this framework for the problem of 
minimizing the spills of the accumulator. 

III. Mode Optimization Problem 

The goal of mode optimization is to schedule the instructions so 
that the number of mode-setting instructions is mmimized. 

A. Simple Mode Optimization 

Let the DAG G = ( V, E) for a basic block be given. Let r be the 
number of modes, and / : V — ► {1, . . . , r} label each node v € V 
with a mode l(v). Let C = [cij] be the cost matrix, where dj > 0 
is the cost of switching from mode i to mode j. We assume that the 
following hold for the cost matrix C: 



cu = 0, for all i 

c%k < Cij+Cjk> for all i, j, /c 



(1) 
(2) 



Inequality (2) (triangular inequality) will be used later to establish 
lower bounds in the branch-and-bound algorithm. The simple mode 
optimization problem (SMOPT) is the problem of finding a linear 
schedule S that is a topological sort of G, (vi , r^, . . . , v n ), such that 



mode.cost(S) = ^ c, (u . )/{t ,. +|) 



(3) 



is minimized. 

Theorem 1: The decision problem for simple mode optimization 
is NP-complete. 

B. Multiple Mode Classes and Don 't-Cares 

In Section III-A we only considered a single mode class. A mode 
class is a set of mutually exclusive modes, and at any point the machine 
can only be in exactly one of the modes. For example, sign extension 
and product shift are two mode classes in the TMS320C25. We assume 
that each mode class has a cost matrix whose values are independent 
of every other mode class. In the absence of don 't-cares, multiple 
mode classes are equivalent to a single mode class: their Cartesian 
product. 

Let us first consider the presence of don't-cares (denoted by — ) 
in a single mode class. A node is labeled — , if it is not affected by 
the current mode. Hence, we can first disregard these nodes, try to 
optimally schedule the other nodes, and then put these nodes back 
in the schedule, consistently with the original DAG. To do so, we 
construct from G a reduced DAG G' = (V', E') as follows: 
Procedure: Don *t-Care Reduction 

1. For each path v\ , V2, . . . > vk-i, Vk in G where l(vj) = -, for 
2 < j < k — 1, and/(ui),/(vjfe) ^ -, we add an edge (vi,t>jb). 
This preserves the precedence relationship from G. 

2. Remove each node v € V such that l{v) = — , and all edges 
incident on or emanating from v. 

Now let us consider m mode classes. The label of each node is 
an m-tuple, some or all of whose components may be — . For each 
mode class p we can construct a reduced DAG from G by projecting 
the label to the corresponding component and then reducing the graph 
using the above procedure; this derived graph is called the p-reduced 
DAG. We may then schedule these reduced DAGs separately. Each of 
these schedules is called a reduced schedule. 

Definition 1: Let S p and S q be valid reduced schedules for the p- 
and g-reduced DAGs. S p and S q are said to be compatible if they can 
be merged to form a schedule that is valid with respect to G. 

For example, the reduced schedules ABEG and AEFG are compat- 
ible; whereas ABEG and AGDE are not, since they have conflicting 
orders for nodes E and G. We state without proof the following 
proposition. 

Proposition 1: Let S p be an optimal reduced schedule for the p- 
reduced DAG, p = 1, . . . , m. If all 5 p 's are compatible, then they 
can be merged to form an optimal schedule for G. 

Unfortunately, not all optimal reduced schedules are compatible. 
Thus, if reduced DAGs are employed in a heuristic algorithm, some 
non-optimal reduced schedules will have to be chosen in order to 
achieve compatibility. In a branch-and-bound algorithm, we consider 
all modes classes simultaneously and compute the lower-bound by 
resolving the don't-cares conservatively, 

C. Example 

Let us consider the expression DAG G shown in Fig. 2(a). There 
are two mode classes, p={u,s} and </={ 0,1,2}. The corresponding 
reduced DAGs are shown in Fig. 2(b) and (c). 

Assuming that any mode change incurs a unit cost, there are two 
optimal reduced schedules for the p-reduced DAG: S p \ —ACBEF 
and S P 2 =BACFE. There is only one optimal reduced schedule for 
the ^-reduced DAG: S q =ADFHEG. S p \ and S q are not compatible, 
since the former schedules F after E whereas the latter schedules F 
before E. However, S p i and S q are compatible, we obtain the optimal 
schedule forG: BACDFHEG. 




Fig. 2. (a) An expression DAG G. (b) p-reduced DAG. (c) ^-reduced 
DAG. 

IV Register Allocation and accumulator Spilling 

This section describes the effect of the scheduling step on the 
number of values that have to be stored to perform a given computation. 
We first focus on conventional architectures and then switch focus to 
irregular datapaths such as the TMS320C25. 

A. RISC Architectures 

Register allocation deals with allocating variables in the given 
basic block to a minimum number of registers. If at any time we 
have to store a set of values whose cardinality exceeds the number 
of available registers, spilling into memory is the only alternative. 
The register allocation problem for RISC machines is quite different 
from machines such as the TMS320C25 since the latter does not have 
a register file. However, in both cases it is possible to arrive at a 
cost function for scheduling that simplifies the succeeding register 
allocation step. 

B. Effect of Scheduling on Register Allocation 

It is well known that different schedules corresponding to a data- 
dependency graph require differing number of registers. We formalize 
this effect in the sequel. 

We are given a basic block represented as a DAG G = (V, E). 
We will assume a sequential model of processor execution, however, 
the discussion can easily be generalized to include parallelism corre- 
sponding to multiple execution threads. Each node v € V is assumed 
to have two input variables ii(v) and i 2 (v) and an output variable 
o(v). We further assume the graph is in static single assignment form, 
that is, every assignment is to a unique variable. 

If we schedule the nodes in G, the the life-times of all the variables 
can be calculated. The life-time of a variable is the duration between 
its unique assignment and last use. Since most processors allow the 
reading and writing of values into a register in a single instruction, if 
a variable is written in instruction i and is read in instruction j > i, 
we will denote its life-time as the (open-ended) interval [i, j). 



vl 


= 


v2 


+ 


v3 


v4 


= 


v2 


- 


v3 


v5 


= 


vl 




v2 


v6 




V4 


& 


v3 


v7 




v5 


1 


v6 






fa) 






Rl 


= 


R2 


+ 


R3 


R4 


= 


R2 


- 


R3 


Rl 


= 


Rl 


* 


R2 


R4 


= 


R4 


& 


R3 


R4 


= 


Rl 


1 


R4 






(b) 






Rl 




R2 


+ 


R3 


Rl 




Rl 


* 


R2 


R2 




R2 




R3 


R2 




R2 


Sc 


R3 


R3 




Rl 


1 


R2 






(C) 







Fig. 3. (a) Code sequence (b) Register allocation on original code 
sequence (c) Register allocation on reordered code sequence 

Variables with non-overlapping life-times can be merged into the 
same register. For example, the variable i\ (v) with life-time [i, j) can 
be merged with variable h{v) with life-time [/c, /) if k > j. 

The number of registers required is proportional to the overlap of 
the live periods of the variables, or to put it differently, the number of 
registers required is the maximal density of variable life-times across 
the entire sequence. Given a set of variable life-times in a schedule 
we can compute the maximal density in linear time. 

The simple register allocation problem is to find the best possible 
grouping of variables with non-overlapping life-times into a minimum 
number of sets. Given a fixed schedule the simple register allocation 
problem (SRAOPT) can be solved in polynomial time. (A polynomial- 
time solution is only possible when static single assignment for each 
variable is assumed, in which case the interference graph is an interval 
graph, which can be colored in polynomial time [3].) However, there 
is freedom in the ordering of the nodes of the given DAG as long 
the dependency constraints are not violated. Given a code sequence, 
exploiting this freedom can result in a smaller set of registers being 
required. This is illustrated in Fig. 3. In Fig. 3(a), an example code 
sequence being executed on a processor with a single arithmetic unit 
is shown. Without changing the order of the operations in the code 
sequence, the minimum number of registers required is 4, as shown 
in Fig. 3(b). Allowing re-ordering of operations within the sequence 
produces a 3 register solution in Fig. 3(c). (A similar example was 
given in [10].) 

Finding the optimal ordering of operations within a sequence, so 
as to allocate a minimum set of registers reduces to a one-dimensional 
linear arrangement problem similar to SMOPT of Section III. The 
register allocation problem (RAOPT), involves finding a schedule S 
that is a topological sort of a basic block represented by a DAG G, 
(v i, V2, . . . ,u n ), such that 

reg_cost(S) = max density (i) (4) 

is minimized, where density(t) corresponds to the number of variables 
that are live at instruction z. 
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C. DSP Microprocessor Architectures 

Register allocation for processors with a general-purpose register 
file is relatively straightforward. Obtaining a schedule that mini- 
mizes the maximal density of variable life-times will result in minimal 
spilling into memory. 

For processors such as the TMS320C25, the register allocation 
problem is more complicated due to several reasons, the foremost of 
which is the indirect addressing mechanism that is used in the datapath. 

• The TMS320C25 has a single accumulator and no general- 
purpose registers. 

• The address registers ARO through AR7 are in turn addressed by 
the register ARP. 

• The address registers ARO through AR7 can be auto-incremented 
and auto-decremented during the execution of any TMS320C25 
instruction that uses indirect addressing mode. However, loading 
a new address requires a separate instruction. 

• The ARP can be switched to point to a different address register 
during any instruction using indirect addressing mode. 

All of the above complications can be taken into account by formu- 
lating the register allocation problem for a fixed code schedule as 
an offset assignment problem [8]. In this paper, we will focus on 
the first item above, and formulate the minimal spilling problem for 
accumulator-based architectures. 

In the TMS320C25, instructions such as ADD, MAC and LAC (load- 
accumulator) write into the accumulator. The MPY instruction writes 
into the P register. Given a schedule it is easy to determine the number 
of accumulator spills using life-time analysis. If the accumulator 
value is used in the immediately following instruction and nowhere 
else, then spilling into memory is not required, else we need to spill 
the contents into memory for later access. Different schedules will 
result in different numbers of accumulator spills. 

The accumulator spilling problem (ASOPT), involves finding a 
linear schedule S that is a topological sort of a basic block represented 
by a DAG G, (vi , i>2, . - . , t>n), such that 

spillxost(S) = Number of accumulator spills in S (5) 

is minimized. 

V. A Branch-and-Bound Algorithm for Scheduling 

In this section we present a branch-and-bound algorithm which, 
given a basic block represented as a DAG, determines an optimal code 
schedule under a specified cost function. 

A. Cost Function 

The cost we use includes the number of accumulator spills required 
by the schedule S (see Eqn. (5)) as well as the number of mode switches 
required (see Eqn. (3)). 

The cost function we use in the branch-and-bound method is: 

C(S) = W s x spilLcost(S) + W M x mode.cost(S) 

where Ws and Wm depend on the relative cost of instructions re- 
quired to spill the accumulator to memory and instructions required to 
accomplish a mode switch. 

B. Branching Search 

Branching over all possible solutions is accomplished using the 
recursive branching strategy of Fig. 4. Initially, find-oprimal- 
scheduIeQ is called with the original DAG G — { V y E) y and P = 4> as 



find-optimaI-schedule( G t P ): 

{ 

/* G « DAG of basic block */ 
/* P = Current partial schedule for DAG */ 
/* C(P) = Cost of partial schedule */ 
N = find-scheduleable-nodes( G, P ); 
if(N = <j>){ 

if (C(P)<C(S)) 

[5, C(S)] = [P t C(P)]; 

return [ 5, C(S) ]; 

} 

foreach node v in N { 

LB = lower-bound( G - v, P U v ); 
if ( C(P Uv) + LB < C(S) ) { 

[ T, C(T) ] = find-optimal-schedule( G - v, P U v ); 
if (C(T)<C(S)) 

[S,C(S))-[T, C(T)]; 

} 

} 

return [ 5, C(S) ]; 

} 

Fig. 4. Branch-and-bound procedure to determine optimal schedule 

parameters. It returns the best schedule S and the cost of the schedule 
C(S). 

The procedure find-scheduleable-nodesO determines the set of 
nodes N € V that can be scheduled given the partial schedule P 
such that dependency constraints are not violated. If there are no 
scheduleable nodes it means that the schedule is complete. If the cost 
of this complete schedule is less than the best cost seen thus far, we 
save the complete schedule P as the best schedule and return to the 
previous level of recursion. 

If there are scheduleable nodes in TV, we select each of them in 
sequence and recursively call find-optimal-schedule(). Once we have 
chosen a node v to add to P, we first compute a lower bound on the 
cost of any schedule we will see in this recursion path. If the cost 
of the partial schedule P U v plus the computed lower bound on the 
unscheduled DAG G - v is greater than or equal to the best cost 
seen thus far, there is no need to explore this recursion path. The 
best schedule with its associated cost is returned by procedure find- 
optimal-scnedule0- 

C. Lower Bound Computation 

The procedure lower-boundO is critical to improving the efficiency 
of the search. If we can compute tight lower bounds, then we can prune 
the search considerably by reducing the depth of recursion. 

Given a DAG G{V } E) we need to compute a lower bound over 
all possible schedules P consistent with G. This entails computing a 
lower bound for mode.cost(P) and a lower bound for spilLcost(P). 

C.l Lower Bound for Spill Cost 

We assume that the accumulator has no useful value upon entry to 
the basic block in Eqn. (5) as well as in the lower bound computation 
below. 

We mark the nodes in the given DAG G using the steps below. 
Initially all nodes are unmarked. 

1 . Nodes whose outputs are outputs of the basic block and which 
write into the accumulator (e.g., MAC and ADD) will spill their 
contents, and these nodes are marked. 



2. If a node v in the G has more than 2 fanouts, it means that o(v) 
is used as an input in two other instructions and this implies that 
the accumulator contents corresponding too(u) has to be spilled 
into memory. 

3. If node v receives inputs from nodes x and y which correspond 
to instructions that write into the accumulator, then either o(x) — 
i\ (v) or o(y) = t2 (v) has to be spilled. Therefore, if both x and 
y are unmarked, we* will mark x or we will mark y (but not both). 

The number of marked nodes in G corresponds to a lower bound on 
the number of accumulator spills in any schedule consistent with G. 

C. 2 Lower Bound for Mode Cost 

To estimate a lower bound for the mode cost given an unscheduled 
DAG G, we simply compute the maximum cost for mode switching 
using Eqn. (3) along any path of G. This follows from Inequality (2), 
since any schedule must contain this path as a subsequence, and the 
cost of switching from mode i — ► j cannot be greater than that of 
mode i — ► j — + h for any k (for otherwise we can replace the former 
by the latter). 

D. Hashing 

The branching procedure of Fig. 4 may perform a significant 
amount of redundant computation. Consider a situation where we 
have constructed a partial schedule Pi which corresponds to the set 
of nodes V\. The optimal scheduling subproblem is then solved for 
G - V\ with appropriate initial conditions corresponding to the mode 
and accumulator contents of the last instruction in Pi . Now, if we 
compute a different partial schedule Pi corresponding to the same set 
of nodes Vi , and the last instruction in Pi is the same as the last instruc- 
tion in Pi , we solve exactly the same optimal scheduling subproblem 
for G - V\ . However, there is no mechanism in the procedure of Fig. 4 
to detect that we have solved the subproblem for G — V\ already. 

The lower bounding technique alleviates the above inefficiency to 
a certain extent. However, since the bounds may not always be tight, 
significant redundant computation may occur. 

A hashing mechanism of "remembering" previously computed op- 
timal solutions for parts of the original DAG can greatly improve 
the efficiency of the search. We hash each partial schedule P; such 
that | Pi | < L, where L is a user-specified parameter in the range 
2 < L < | V\. The hash is computed in such a way that if different 
partial schedules Pi and Pj contain the same set of nodes Vi, the 
same hash is computed. Once the subproblem of finding an optimal 
schedule for G — V\ has been solved for P», we store the result in a 
hash table that is accessible by the hash for P;. Before we begin to 
solve the subproblem associated with Pj , we check to see if \Pj\ < L, 
if so we compute the hash for Pj and access the hash table. If we get 
a "hit" in the hash table, we immediately return the best solution for 
the subproblem of scheduling G - V\ . A hit in the hash table implies 
that Pi and Pj correspond to the same set of instructions, and further 
that the last instructions in P» and Pj are the same. 

E. Heuristics 

Once the nodes in N have been determined by the procedure 
find-scheduleable-nodesQ, we can recursively call find-optimal- 
scheduleO for each of the nodes in N in any order. However, to 
improve efficiency it is worthwhile to first explore partial solutions 
that have a good chance of being extended to optimal solutions. This 
determination can only be made heuristically. 

We sort the nodes in N based on a cost estimate to obtain a sorted 
list sort(N). The cost estimate for any v € N is equal to C(PU v) — 
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C(P) which includes both the spill cost and the mode switching cost. 
Nodes in N are sorted in increasing order of this cost estimate. 

For small to moderate-sized basic blocks the optimal algorithm of 
Fig. 4 that explores all possible solutions is viable. For large basic 
blocks, we have to resort to heuristic search techniques. A fast, greedy 
heuristic is based on the node sorting method described above. We 
only explore solutions corresponding to the first t nodes in the sorted 
list sort(N). The foreach loop of Fig. 4 is replaced with t calls to 
find-optimal-scheduleO, where typically 1 < t < 3. 

VI. Experiments and Results 

We have implemented the heuristic algorithm of Section V-E to 
perform scheduling for minimum cost. Our experiments are based on 
a code generator for a simplified TMS320C25 architecture with all the 
features described in Section II. A full-featured TMS320C25 code 
generator is currently under development (see Section VII). 

To accurately account for the various types of costs, we attribute 
the following cost components to each node. 

• Instruction (I). Each node in the DAG is associated with an 
instruction that has a cost of 1 . 

• Common subexpression (C). If the node has two or more uses, it is 



a common subexpression. Under our assumption of aggressive 
common subexpression elimination, the result of this node is 
stored to memory rather than be recomputed at a later time. 

• Live-on-exit variable (V). The result of the computation for this 
node is live upon exit of this basic block. 

• Load (L). One operand of the node is not in the accumulator; 
therefore, it needs to be loaded. 

• Spill (S). The result of the previously scheduled node will be 
used later but not now. 

• Mode change (M). The node requires a mode setting different 
from the current setting. The mode classes considered are sign- 
extension and product-shift. 

The first three items are fixed costs. Under any schedule, the num- 
ber of instructions, common subexpressions, and live-on-exit variables 
remains the same. Therefore, the only optimizable costs are those of 
loads, spills, and mode changes. 

We present our experimental results in Tables I and II. The former 
gives the original schedule generated by the front-end after common 
subexpression elimination. This schedule closely resembles that found 
in the source code. The latter shows the results we have obtained after 
our heuristic scheduling. 

SpeedCtl is a routine in an ADPCM transcoder applying the CCITT 
recommendation G.72 1 . Compaction is a notch-filter routine. FFTBR 
is an Fast Fourier Transform routine with a mechanism to prevent 
overflows. These three are taken from the DSPstone benchmark suite 
[11]. ChenDct, ChenlDct, LeeDct, and LeelDct are discrete cosine 
transform routines in a JPEG package. We have chosen the largest 
basic blocks from these routines. The column labeled "ratio (LSM)" 
gives the ratio of only the optimizable costs (loads, spills, and mode 
changes), where as the "ratio (all)" column gives the ratio with the 
fixed costs (I, C, and V) taken into account. These results are very 
encouraging. Even with the simple heuristic we were able to achieve 
substantial improvement over the schedule given by the front-end. 

VII. Conclusions and Ongoing Work 

Code generation for irregular datapaths, such as those used in 
DSP microprocessors, is a problem that has received relatively little 
attention to date. With the advent and increasing use of embedded 
systems, this problem has become very important. In this paper we 
presented scheduling algorithms that are able to exploit the features 
of the TMS320C25 microprocessor. Our initial results indicate that 
these algorithms obtain substantial improvements in code size and 
performance over conventional code generation techniques. 

We are currently developing a framework for retargetable code 
generation [9]. There are many avenues for further work in this 
area. Our framework is directly applicable to traces [4] [5] rather 
than just basic blocks, and experiments on traces will be conducted 
in the near future. Traces will allow for more global optimization 
and afford the possibility of even greater savings over conventional 
optimization. One way to avoid the possible code explosion caused 
by trace scheduling is to restrict the movement across basic blocks to 
mode-setting instructions. This way we ensure that along the most 
frequent traces the number of such instructions is minimized. 

The framework can be easily generalized to accumulator-based 
machines which also have a general-purpose register file such as the 
TMS320C40. This can easily be done by adding Eqn. (4) to the cost 
function. 

Storage assignment [8] is a very important post-scheduling prob- 
lem that has to be solved in order to ensure that memory accesses have 
minimal cost. For machines (such as the TMS320C25) without index- 



ing addressing mode, variables are accessed through address registers 
(the ARs) and it is desirable that the auto-increment/decrement feature 
be efficiently utilized. The placement of variables in storage has a 
significant impact on the size and performance of the generated code. 
For instance, if variables are accessed , in the order abacd and the 
following assignment is made: a:l, b:2, c:3, and d:4, accessing a 
followed by c requires an explicit instruction to increase the AR by 
two; every other access can be accomplished using auto-increment or 
decrement. On the other hand, if we use the assignment: a:2, b: 1, c:3, 
and d:4, then all changes in the AR can be done via auto-increment or 
decrement; no explicit instruction for changing the AR is necessary. 
This problem is related to the mode optimization problem in that the 
AR can be considered a mode class and our goal is again to mini-' 
mize the number of "mode-setting" instructions. Its relationship with 
scheduling is, however, much more complicated because before the 
actual assignment is made, the information is only symbolic, and it is 
very difficult to estimate the effect of scheduling on offset assignment. 

Finally, to fully exploit the features of many DSP microproces- 
sors, zero-overhead loops have to be detected and appropriate code 
generated wherever possible. 
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Abstract 

Decision support applications are growing in popularity as 
more business data is kept on-line. Such applications typ- 
ically include complex SQL queries that can test a query 
optimizer's ability to produce an efficient access plan. Many 
access plan strategies exploit the physical ordering of data 
provided by indexes or sorting. Sorting is an expensive op- 
eration, however. Therefore, it is imperative that sorting 
is optimized in some way or avoided all together. Toward 
that goal, this paper describes novel optimization techniques 
for pushing down sorts in joins, minimizing the number of 
sorting columns, and detecting when sorting can be avoided 
because of predicates, keys, or indexes. A set of fundamen- 
tal operations is described that provide the foundation for 
implementing such techniques. The operations exploit data 
properties that arise from predicate application, uniqueness, 
and functional dependencies. These operations and tech- 
niques have been implemented in IBM's DB2/CS. 

1 Introduction 

As the cost of disk storage drops, more business data 
is being kept on-line. This has given rise to the no- 
tion of a data warehouse, where non-operational data is 
typically kept for analysis by decision support applica- 
tions. Such applications typically include complex SQL 
queries that can test the capabilities of an optimizer. 
Often, huge amounts of data are processed, so an op- 
timizer's decisions can mean the difference between an 
execution plan that finishes in a few minutes verses one 
that takes hours to run. 

Many access plan strategies exploit the physical or- 
dering of data provided by indexes or sorting. Sorting is 
an expensive operation, however. Therefore, it is imper- 
ative that sorting is optimized in some way or avoided 
all together. This leads to a non-trivial optimisation 
problem, however, because a single complex query can 
give rise to multiple interesting orders [SAC+79]. Here, 
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an interesting order refers to a specification for any or- 
dering of the data that may prove useful for processing 
a join, an ORDER BY, GROUP BY, or DISTINCT. 
To be effective, an optimizer must detect when indexes 
provide an interesting order, the optimal place to sort 
if sorting is unavoidable, the minimal number of sorting 
columns, whether two or more interesting orders can be 
combined and satisfied by a single sort, and so on. This 
process, will be referred to as order optimization. 

At first glance, it might seem like hash-based set op- 
erations [BD83, DKO+84] make order optimization a 
non-issue, since hash-based operations do not require 
their input to be ordered. An index may already provide 
an interesting order for some operation, however, mak- 
ing the hash-based alternative more expensive. This is 
particularly true in warehousing environments, where 
indexes are pervasive. As a result, an optimizer needs 
to be cognizant of interesting orders. It should always 
consider both hash- and order-based operations and pick 
the least costly alternative [Gra93]. 

Although people have been building SQL query op- 
timizers for close to twenty years [JV84, Gra93], there 
has been surprisingly little written about the problem 
of order optimization. This paper describes novel tech- 
niques to address that problem. One of the paper's key 
contributions is an algorithm for reducing an interest- 
ing order to a simple canonical form by using applied 
predicates and functional dependencies. This is essen- 
tial for determining when sorting is actually required. 
Another important contribution is the notion of sort- 
ahead, which allows a sort for something like an OR- 
DER BY to be pushed down in a join tree or view. All 
of these techniques have been implemented in the query 
optimizer of IBM's DB2/CS, which is the client-server 
version of DB2 that runs OS/2, Microsoft Windows NT, 
and various flavors of UNIX. Henceforth, DB2/CS will 
be referred to as simply DB2. Much of the discussion in 
this paper is framed in the context of the DB2 query op- 
timizer. The techniques that are described have general 
applicability, however, and could be used in any query 
optimizer. 

The remainder of this paper is organized as follows: 
In Section 2, related work is described. This is followed 
by a brief overview of the DB2 optimizer in Section 3. 
Next, fundamental operations for order optimization are 
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described in Section 4. In Section 5, the architecture of 
the DB2 optimizer that has been built around those fun- 
damental operations is described. An example is then 
provided in Section 6 to illustrate how things tie to- 
gether. Advanced issues beyond the scope of this paper 
are mentioned in Section 7. Finally, performance results 
are presented in Section 8, and conclusions are drawn 
in Section 9. 

2 Related Work 

The classic work on the System R optimizer by Selinger 
et al. [SAC + 79] was the first research to look at the 
problem of order optimization. That paper coined the 
term "interesting orders" . In System R, interesting or- 
ders were mainly used to prevent eubplans that satisfy 
some useful order from being pruned by less expensive 
but unordered subplans during bottom-up plan genera- 
tion. 

A recent paper on the Rdb optimizer [Ant93] talked 
about combining interesting orders from ORDER BY, 
GROUP BY, and DISTINCT clauses, if possible, so at 
most one sort could be used. That paper was primarily 
an overview of the Rdb optimizer, however. It did not 
specifically focus on order optimization. 

Other, more loosely related papers include those on 
predicate migration [Hel94] and group-by push-down 
[YL93, CS93]. Predicate migration considers whether 
an expensive predicate should be applied before or af- 
ter a join. Similarly, group-by push-down considers 
whether GROUP BY should be performed before a join. 
In each case, an optimizer determines which is the bet- 
ter alternative using its cost estimates. Both techniques 
are similar to the notion of sort-ahead, as described in 
this paper. 

3 Overview 

The DB2 optimizer is a direct descendent of the Star- 
burst optimizer described in [Loh88, HFLP89]. Among 
other things, the DB2 optimizer uses much more sophis- 
ticated techniques for order optimization. This section 
provides an overview of the DB2 optimizer to establish 
some background and terminology. More details will be 
given later. 

The DB2 optimizer actually has several distinct op- 
timization phases. Here, we are mainly concerned with 
the phase where traditional cost-based optimization oc- 
curs. Prior to this phase, an input query is parsed and 
converted to an intermediate form called the query graph 
model (QGM). 

The QGM is basically a high-level, graphical repre- 
sentation of the query. Boxes are used to represent re- 
lational operations, while arcs between boxes are used 



to represent quantifiers, i.e., table references. Each box 
includes the predicates that it applies, an input or out- 
put order specification (if any), a distinct flag, and so 
on. The basic set of boxes include those for SELECT, 
GROUP BY, and UNION. Joins are represented by a 
SELECT box with two or more input quantifiers, while 
ORDER BY is represented by a SELECT box with an 
output order specification. 

After its construction, the original QGM is trans- 
formed into a semantically equivalent but more "effi- 
cient" QGM using heuristics such as predicate push- 
down, view merging, and subquery-to-join transforma- 
tion. [PHH92]. Finally, cost-based optimization is per- 
formed. During this phase, the QGM is traversed and 
a query execution plan (QEP) is generated. 

A QEP can be viewed as a dataflow graph of oper- 
ators y where each node in the graph corresponds to a 
relational operation like a join or a low-level operation 
like a sort. Each operator consumes one or more input 
records (i.e., a table), and produces an output set of 
records (another table). We will refer to these as input 
and output streams. Figure 1 illustrates what the QGM 
and QEP might look like for a simple query. 

QUERY 

select ay. sumfb.y) 
from a, b 
where a x = b.x 
group by a.y 



QGM QEP 




Figure 1: Simple QGM and QEP Example 

Each stream in a QEP has an associated set of proper- 
ties [GD87, Loh88]. Examples of properties include the 
columns that make up each record in the stream, the set 
of predicates that have been applied to the stream, and 
the order of the stream. Each operator in a QEP deter- 
mines the properties of its output stream. The proper- 
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ties of an operator's output stream are a function of its 
input stream(s) and the operation being applied by the 
operator. For example, a sort operator passes on all the 
properties of its input stream unchanged except for the 
order property and cost. Note that a stream's order, if 
any, always originates from an ordered index scan or a 
sort. 

During the planning phase of optimization, the 
DB2 optimizer builds a QEP bottom-up, operator-by- 
operator, computing properties as it goes. At each step, 
different alternatives are tried and more costly subplans 
with comparable properties are pruned [Loh88]. At 
strategic points during planning, the optimizer may de- 
cide to build a QEP which satisfies an interesting order: 
A sort may need to be added to a QEP if there is no 
existing QEP with an order property satisfying the in- 
teresting order. 

Interesting orders are generated in a top-down scan - 
of QGM prior to the planning phase. This is referred 
to as the order scan of QGM. Interesting orders arise 
from joins, ORDER BY, GROUP BY, or DISTINCT, 
and are hung off the QGM. Here, both order properties 
and interesting orders will be denoted as a simple list 
of columns in major to minor order, i.e., (ci,ca, ...,Cn). 
Without loss of generality, we will always assume that 
an ascending order is required for each column c.. 

Interesting orders are pushed down and combined in 
the order scan whenever possible. This allows one sort 
to satisfy multiple interesting orders. As interesting or- 
ders are pushed down they can turn into sort- ahead or- 
ders. These allow the optimizer to try pushing down a 
sort for, say, an ORDER BY to an arbitrary level in a 
join tree. Different alternatives are tried, and only the 
least costly one is kept. The next section looks at the 
fundamental operations on interesting orders needed to 
accomplish these tasks. 

4 Fundamental Operations for 
Order Optimization 

4.1 Reduce Order 

The most fundamental operation used by order opti- 
mization is something referred to as reduction. Reduc- 
tion is the process of rewriting an order specification 
(i.e., an order property or interesting order) in a simple 
canonical form. This involves substituting each column 
in the specification with a designated representative of 
its equivalence class (called the equivalence class head) 
and then removing all redundant columns. Reduction is 
essential for testing whether an order property satisfies 
an interesting order. 

As a motivating example, consider an arbitrary inter- 
esting order / = (z,y), and suppose an input stream 
has the order property OP = (y). A naive test would 



conclude that / is not satisfied by OP, and a sort would 
be added to the QEP. Suppose, however, that a predi- 
cate of the form col — constant has been applied to the 
input stream, e.g., x = 10. Then the column z in J is re- 
dundant since it has the value 10 for all records. Hence, 
J can be rewritten as I = (y). After being rewritten, 
it is easy to determine that OP satisfies J, so no sort 
is necessary. Note that a literal expression, host vari- 
able, or correlated column qualify as a constant in this 
context. 

Reduction also needs to take column equivalence 
classes into account. These are generated by predicates 
of the form col = col. For example, suppose J = {z t z) 
and OP = (y t z). Further suppose that the predicate 
x — y has been applied. The equivalence class gener- 
ated by as = y allows OP to be rewritten as OP = (z, z). 
After being rewritten, it is easy to determine that OP 
satisfies /. 

Reduction also needs to take keys into account. For 
example, suppose J = (x t y) and OP = (x t z). If x 
is a key, then these can be rewritten as / = (x) and 
OP = (x). Here, y and z are redundant since x alone is 
sufficient to determine the order of any two records. 

Keys are really just a special case of functional de- 
pendencies (FDs) [DD92]. So rather than keys, FDs are 
actually used by reduction, since they are more power- 
ful. In the DB2 optimizer, a set of FDs are included in 
the properties of a stream. The way FDs are maintained 
as a property will be discussed in more detail later. 

The notation used for FDs is as follows: A set of 
columns A = {c^, a 2 , ...,On} functionally determines 
columns B = {61,65,..., 6 m } if for any two records with 
the same values for columns in A, the values for columns 
in B are also the same. This is denoted as A — y B. The 
head of the FD is A t while the tail is B. 

It is important to note that all of the above optimiza- 
tions can be framed in terms of functional dependen- 
cies. This is because a predicate of the form x = 10 
gives rise to {} {x} t i.e., the "empty-headed" FD 
[DD92]. Moreover, a predicate of the form x = y gives 
rise to {x} {y} and {y} -> {a;}. If x = y is a join 
predicate for an outer join, then {x} -f {y} holds if x is 
a column from a non-null-supplying side. In addition, 
{«} -> {all cols} when x is a key. Finally, {x} {x} 
is always true. 

The mapping of predicate relationships and keys to 
functional dependencies makes it possible to express re- 
duction in a very simple and elegant way. The algorithm 
for Reduce Order is shown in Figure 2. In the algorithm, 
note that the equivalence class head is chosen from those 
columns made equivalent by predicates already applied 
to the stream. Also note that B -» {c*} if there exists 
some B f -+C where B' C B and C D {c*}. This follows 
from the algebra on FDs [DD92]. Consequently, simple 
subset operations can be used on the input FDs to test 
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whether B -> {cj. 

Reduce Order 
input 

a set of FDs, applied predicates, and 
order specification O = (ci,C2, ...,Cn) 
output: 

the reduced version of O 

1) rewrite O in terms of each column's 
equivalence class head 

2) scan O backward^ 

3) for (each column c» scanned) 

4) let B = {c u c 2t Ci- t }, i.e., 
the columns of O preceding Ci 

5) if ( B -f {c,} ) then 

6) remove c, from O 

7) endif 

8) endfor 

Figure 2: Reduce Order Algorithm 

The correctness proof for Reduce Order is straightfor- 
ward. Consider what happens when two records r\ and 
rj are compared. The only time the value of c» affects 
their order is when ri and have the same values for all 
columns in C. But then ri.c» and ^.c, must also have 
the same value because B — ► {c t }. Consequently, re- 
moving a* will not change the order of records produced 
by O. 

Before moving on, note that an order specification 
can become "empty" after being reduced. For example, 
suppose the predicate x = 10 has been applied and the 
interesting order I — (x) is reduced. The predicate 
x = 10 gives rise to {} {x}. Consequently, I will 
reduce to the empty interesting order / = (), which is 
trivially satisfied by any input stream. 

4.2 Test Order 

As it generates a QEP, the optimizer has to test whether 
a stream's order property OP satisfies an interesting or- 
der J. If not, a sort is added to the QEP. The algorithm 
for Test Order is shown in Figure 3. Note that when a 
sort is required, the reduced version of I provides the 
minimal number of sorting columns, which is important 
for minimizing sort costs. 

4.3 Cover Order 

As mentioned earlier, the DB2 optimizer tries to com- 
bine interesting orders in the top-down order scan of 
QGM. This often allows one sort to satisfy multiple in- 
teresting orders. When two interesting orders are com- 
bined, a cover is generated. The cover of two interesting 



Test Order 
input: 

an interesting order I and an order 
property OP 
output: 

true if OP satisfies /, otherwise false 

1) reduce I and OP 

2) if ( I is empty or the columns in I 

are a prefix of the columns in OP ) then 

3) return true 

4) else 

5) return false 

6) endif 

Figure 3: Test Order Algorithm 

orders I\ and I 2 is a new interesting order C such that 
any order property which satisfies C also satisfies both 
7i and I 2 . For example, the cover of Ii = (x) and 
h = is C - (x,y). 

Of course, it is not always possible to generate a 
cover. For example, there is no cover for h = (y, x) 
and I2 = (x t y t z). As in Test Order, however, interest- 
ing orders need to be reduced before attempting a cover. 
Suppose the predicate x = 10 has been applied in this 
example. Then the interesting orders would reduce to 
Ii = (y) and I 2 = giving the cover C = (y,z). 

The algorithm for Cover Order is shown in Figure 4. 

Cover Order 
input: 

interesting orders L and I 2 
output 

the cover of Ji and I 2 ; or a return code 
indicating that a cover is not possible 

1) reduce J\ and I 2 

2) w.Lo.g.y assume Ii is the shorter interesting order 

3) if ( I\ is a prefix of h ) then 

4) return I 2 

5) else 

6) return "cannot cover h and I 2 n 

7) endif 

Figure 4: Cover Order Algorithm 



4.4 Homogenize Order 

As mentioned earlier, an attempt is made to push down 
interesting orders in the order scan of QGM so that sort- 
ahead may be attempted. When an interesting order I is 
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pushed down, some columns may have to be substituted 
with equivalent columns in the new context. This is 
referred to as homogemzatton. For example, consider 
the following query: 

select * 
from a, b 
where a.x = boc 
order by a.x, b.y 

Here, the ORDER BY gives rise to the interesting 
order J = (a.x, b.y). The order scan will try to push 
down / to the access of both table a and table 6 as a sort- 
ahead order. For the access of table b t the equivalence 
class generated by a. x = 6.x is used to homogenise / as 
J 6 = (6.x,&.y). 

/ cannot be pushed down to the access of table a, 
since b.y is unavailable until after the join. However, 
suppose a.x is a base-table key that remains a key after 
the join [DD92]. If so, {a x} {b.y}. This allows / 
to be reduced to J = (a.x), which can be pushed down 
to the access of table a. As this example illustrates, 
an interesting order needs to be reduced before being 
homogenized. The algorithm for Homogenize Order is 
shown in Figure 5. 

Homogenize Order 
input 

an interesting order I and target 
columns C = {ci , ca Cn} 
output 

I homogenized to C, that is, Ic\ or a return 
code indicating that Ic is not possible 

1) reduce J 

2) using equivalence classes, try to substitute each 
column in I with a column in C 

3) if ( all the columns in I could be substituted ) then 

4) return Ic 

5) else 

§) return "cannot homogenize J to C n 
7) endif 

Figure 5: Homogenize Order Algorithm 

Note that unlike Reduce Order, Homogenize Order 
can choose any column in the equivalence class for sub- 
stitution. Moreover, there is no need to choose from 
just the columns that have been made equivalent by 
predicates applied so far. Columns that will become 
equivalent later because of predicates that have yet to 
be applied can also be considered. This is because ho- 
mogenization is concerned with producing an order that 
will eventually satisfy I. 



5 The Architecture for Order 
Optimization in DB2 

This section describes the overall architecture of the 
D82 optimizer for order optimization. Only a high-level 
summary of the architecture is provided. The focus will 
be those parts of the architecture that have been built 
around the fundamental operations discussed in the pre- 
vious section. 

5.1 The Order Scan of QGM 

As mentioned earlier, interesting orders are generated 
during the order scan, which takes place prior to the 
planning phase of optimization. Interesting orders arise 
from joins, ORDER BY, GROUP BY, or DISTINCT, 
and are hung off the QGM. 

Each QGM box has an associated output order re- 
quirement, and each QGM quantifier has an associated 
input order requirement. In contrast to an interesting 
order, an order requirement forces a stream to have a 
specific order. Either the input or output order require- 
ment can be empty. Output order requirements come 
from ORDER BY, while input order requirements cur- 
rently come from GROUP BY. (Note that this does not 
preclude hash-based GROUP BY from being consid- 
ered during the planning phase of optimization.) Each 
QGM box also has an associated list of interesting or- 
ders, which can double as sort-ahead orders. 

Conceptually, the order scan has four stages. In the 
first stage, input and output order requirements are de- 
termined for each QGM box. Then, interesting orders 
for each DISTINCT is determined. Next, interesting 
orders for merge- joins and subqueries are determined. 
Finally, the QGM graph is traversed in a top-down man- 
ner. 

In the top-down traversal, interesting orders are re- 
cursively pushed down along quantifier arcs. When an 
interesting order is pushed down to a quantifier Q } it 
gets homogenized to Q's columns and then covered with 
Q*9 input order requirement, if any. Similarly, before an 
interesting order can be pushed into a box B and added 
to jB*s list of interesting orders, it gets covered with B'b 
output order requirement. 

One subtlety in the order scan is that the algorithms 
for Cover Order and Homogenize Order require their 
inputs to be reduced. This in turn requires a set of 
applied predicates and FDs. Unfortunately, these are 
not known in the order scan since they are computed as 
properties during the planning phase of optimization. 

This problem is resolved by proceeding optimistically. 
When an interesting order I is pushed down, the or- 
der scan simply assumes that all the predicates below a 
given box have been applied. Purthermore, if J cannot 
be fully homogenized to a quantifier, the largest prefix 
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of / that can be homogenized is used. This is done in 
the hope that some FD will make the suffix redundant. 
The planning phase can detect when these assumptions 
turn out to be false. 

5.2 The Planning Phase of 
Optimization 

During the planning phase of optimization, the DB2 
optimizer walks the QGM bottom-up, box-by-box, and 
incrementally builds a QEP. For each box, alternative 
subplans are generated, and more costly subplans with 
comparable properties are pruned [Loh88]. The input 
and output interesting orders associated with each box 
are used to detect when a sort is required. 

As a QEP is built, the interesting orders that hang off 
a QGM box are used for both pruning and to generate 
sort-ahead orders. During join enumeration, for exam- 
ple, the optimizer will try sorting the outer for each 
interesting order it finds. This allows a sort for, say, an. 
ORDER BY to be pushed down an arbitrary number 
of levels in a join tree or view. If no sort is actually 
required at any level, this will be detected, of course. 
Note that this is only done for join methods where the 
order of the outer stream is propagated by the join. 

When an interesting order is pushed down to the 
outer of a join, it has to be homogenized to the quan- 
tifier^) that belong to the outer. This cannot be done 
during the order scan, since the order in which joins are 
enumerated is not known then. In the case of a merge- 
join, a cover with the merge-join order is also required. 

Unfortunately, the process of pushing down sort- 
ahead orders increases the complexity of join enumera- 
tion [OL90]. This is because two join subtrees with the 
same tables but different orders are not compared and 
pruned against each other. It is possible to show that 
the complexity of join enumeration increases by a factor 
of 0(n 2 ) for n sort-ahead orders. In practice, this has 
not been problem, since typically n < 3. 

5.2.1 Properties 

For order optimization, the most important properties 
are the order property, the predicate property, the key 
property, and the FD property. Each of these is dis- 
cussed in detail below. For any property x, the two 
primary ssues are how x propagates through operators 
and how two plans are compared on the basis of x. 

How the different properties propagate will be dis- 
cussed shortly. In terms of the way properties are com- 
pared, the DB2 optimizer treats everything uniformly. 
Let Pi and P% be two plans being compared. Also, over- 
load the symbol "<" for properties to mean less general 
or equivalent. Then P 2 prunes P\ if P%. cost < P\.cost 
and for every property as, P\.x < Pj.as. In other words, 
Pi can be pruned if it costs more than P2 and has less 



general properties. Thus, for pruning, it suffices to de- 
fine < for each property. 

The Order Property 

The order property (if any) of a stream always originates 
from an ordered index scan or a sort. The way it prop- 
agates for most relational operators is straightforward 
except for projections and joins. If any column c% of an 
order property OP = (ci,c 2 , ...,Cn) is projected, then 
only the prefix OP 1 = (c t , c 2 , .... c^-i) is propagated. 

For both nested-loops and merge-join [BE76], the or- 
der of the outer stream is propagated. In the special 
case when the outer has only one record, however, the 
inner order is propagated. There are also circumstances 
where the outer and inner orders can be concatenated, 
but that discussion is beyond the scope of this paper. 
For hash-join [DKO + 84], neither the outer nor inner or- 
der is propagated. 

The Test Order algorithm given in Section 4.2 is used 
to compare the order properties OP\ and OP2 of two 
plans during pruning. Let int{OP\) denote OP\ cast as 
an interesting order. Then "< n can be defined for the 
order property as follows: OPi < 0P 2 if 0P 2 satisfies 
int(OP x ). 

The Predicate Property 

The predicate property is simply the set of conjuncts 
which have been applied to a stream. Each operator 
propagates the predicate property by taking the predi- 
cate property of its input stream and unioning it with 
any conjuncts applied by the operator. For the predi- 
cate property, is defined as follows: Let PPi and 
PP2 be the predicate properties of two plans being com- 
pared. Then, PPj < PP 2 if PPi C PP 2 . 

The predicate property is used to determine both 
column equivalences and functional dependencies that 
arise from the application of equality predicates. In the 
DB2 optimizer, FDs that arise horn predicates are not 
actually added to the FD property, however, since this 
information would be redundant and would only add 
to the complexity of maintaining the FD property (see 
below). 

The Key Property 

The key property of a stream is the set of unique keys 
for the stream. Each key K is represented as a set 
of columns K — {c 2) c 2 , ...,c«}- Keys are useful for a 
variety of reasons beyond their role in order optimiza- 
tion. One example is their use in DISTINCT eliinina- 
tion [PL94]. Consequently, in the DB2 optimizer, keys 
are maintained as a separate property. 

Keys originate from base-table constraints or can be 
added via a GROUP BY or DISTINCT operation. If 
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any column c, of a key K = {ci,c 2l ...,Cn} in a key 
property KP is projected by an operator, then K is 
removed from KP. 

Whether a key propagates in a join requires anal- 
ysis of the join predicates and the keys of the join's 
input streams. Consider the join of two streams Si 
and 5a on join predicates JP. Let the key proper- 
ties of Si and S 2 be denoted as KP\ and KP 2 respec- 
tively. If a given row of S\ can match at most one 
row of S 2 (i.e., the join is n-to-1), then KPi is prop- 
agated. This is true if any key K = {ci,c 3l ...,Cn} of 
KP 2 is fully qualified by predicates in JP of the form 
Si. col = S 2 ,Cj for all c*. Similarly, if the join is 1- 
to-n, then KP 2 is also propagated. If neither K Pi nor 
K Pi can be propagated, then the key property of the 
join is formed by generating all concatenated key pairs 
Ki K 2t where K x G K Pi and K 2 E KP 2 . For example, 
if Ki = {ai,a 2 ,...,a„} and K 2 = {h t b 2f ... t b m } then 
Ki K 2 = {ai t a 2t On, &i, b 2i ... t b m }. 

An attempt is made to keep each key property as 
"succinct" as possible by removing keys that have be- 
come redundant because of projections and/or applied 
predicates. Each key is rewritten in a canonical form 
by substituting each column with its equivalence class 
head and removing redundant columns. If the D52 op- 
timizer detects that some key has become fully qualified 
by equality predicates during this process, then the en- 
tire key property is discarded and a one-record condition 
is flagged. This condition serves as the key property and 
indicates that at most one record is in the stream. 

After simplifying each key in the property, redundant 
keys are removed from the key property using the defi- 
nition of u <" that follows: Let key K\ = {a lt o 2 , a^} 
and let key K 2 = {b u b 2) ...,& m }. Then #1 < K 2 if 
{6i,*2, C {a lf a 2l ...,a„}. J£ K x < K 2 , then K x 

is implied by K 2 . la that case, K x is redundant and can 
be removed. 

The definition of "< n is also used to compare the key 
properties KPi and KP2 of two plans during pruning. 
More specifically, KP X < KP 2 if for all Ai € KPi there 
exists some K 2 6 KP2 , where the relationship Ki < K 2 
holds. In other words, each Ki E KPi is implied by 
some K 2 € K P 2 . 

The FD Property 

In the DB2 optimizer, the FD property is simply a set 
of FD8, which can be empty. Each FD originates from 
a key. A key becomes an FD when it fails to propa- 
gate through a join. The columns of the key become 
the new FD's head, and the remaining columns in the 
key's stream become the new FD's tail. As an exam- 
ple, assume K — {ci} is a key in the join stream S 
with columns {ci.ca, ...Cn}. Further assume that the 
key property KP of 5 does not propagate in the join. 
Then, {ci} — ► {cj, ... ( c*} is added to the FD property of 



the join. The same is done for all keys in KP. Note that 
if S had a one-record condition, then the empty-headed 
FD {} -» {ci, c 2) Cn} would be generated. 

The effect of projection on the FD property is similar 
to the effect of projection on the key property. Let A = 
{a u a 2) . and B — {bi,b 2i ...,6 m }. Then let F be 
a member of the FD property FP, where F is defined 
as A -4 B. If any column o» in A is projected, then F 
is removed from FP. In contrast, if any column 6» in B 
is projected, then F* replaces F, where jF' is identical 
to F but with bi removed from B. 

Except for projection, FDs almost always propagate 
unchanged. In a join, the FDs of the outer and inner 
stream are combined and keys that do not propagate 
are used to generate new FDs, as described above. The 
resulting set of FDs can then be used to infer still more 
FDs [DD92]. This is not done in the DB2 optimizer 
because of its complexity, which is NP-complete in the 
general case [5B79]. 

Like the key property, an attempt is made to keep 
each FD property as "succinct tt as possible by removing 
FDs that have become redundant because of projections 
and/or applied predicates. First, each FD is rewritten 
in a canonical form by substituting each column with its 
equivalence class head and removing redundant columns 
from both the head and tail. Then redundant FDs are 
removed from the FD property using the definition of 
u <" that follows: Let F x be defined as Ai -» Bi and let 
F 2 be defined as A 2 -¥ B 2 . Then, F x < F 2 if A 2 C A x 
and B 2 D B x . If Fx < F 2) then Fi is implied by F 2 . 
In that case, F x is redundant and can be removed from 
the FD property. 

The definition of w <" is also used to compare the FD 
properties FPi and FP 2 of two plans during pruning. 
More specifically, FPi < FP 2 if for all Fi € FPi there 
exists some F 2 G FP 2f where the relationship Fi < F 2 
holds. In other words, each F x G FPi is implied by 
some F 2 G FP 2 , 

6 An Example 

An example that illustrates how some of the techniques 
tie together is shown in Figure 6. In the example, the 
ORDER BY's interesting order OB = (a.x) was pushed 
down and covered with the GROUP BY's interesting 
order GB = (a.x t a.y t b.y). The resulting cover was then 
pushed down and itself covered with the merge-join's 
interesting order Af J = (b.x). The key on b.x gives 
rise to {b.x} {b.y}, which propagates through all 
the joins. This FD and the equivalence class generated 
by the predicate a.x — b.x allowed the optimizer to 
detect that GB can be reduced to GB = (a.sc.a.y). As 
a result, the sort on a.x t a.y simultaneously satisfies all 
interesting orders. 

As shown, the optimizer determined that pushing 
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QUERY 



QEP 



paper still prevails. 



select a.x, ay, b.y, sumfcj) 
from a y b,c 
where clx = 6jc 
and bjc = c.x 
group by a.x, a.y, b.y 
order by a.x 




sort produces 
order (ajc.CLy). which 
satisfies the merge-join, 
group by, and order by 



Figure 6: Query Example 

down the sort before the first join results in the most 
efficient QEP. This is likely to be true if the size of table 
a is smaller than the result of either join. Because of the 
indexes on b.x and c.x, the resulting QEP would prob- 
ably beat one that used hash-based operators. Finally, 
note that the sort could be eliminated if there was an 
ordered index on a.x t a.y. 

7 Advanced Issues 

One of the issues that we have tacitly avoided in this 
paper is the fact that the order-based GROUP BY and 
DISTINCT operators do not dictate an exact interest- 
ing order. For example, consider a GROUP BY for 
x 1 y t su7n(di$tinct z). This can be satisfied by (x t y,z) 
or (y, Moreover, x, t/, and z can be in ascending 

or descending order. In fact, a total of sixteen different 
orders can satisfy the order-based GROUP BY. 

Rather than generate sixteen different interesting or- 
ders, one general interesting order is used in the real 
implementation. It includes information about which 
columns can be permuted and which columns can be in 
ascending or descending order. Using this information, 
the DB2 optimizer can correctly detect any order that 
satisfies the order-based GROUP BY. Accounting for 
these "degrees of freedom" adds a non-trivial amount 
of complexity to all operations on orders. It probably 
doubled the amount of code. In general, though, the 
same underlying logic that has been described in this 



8 Performance Results 

Clearly, the techniques described in this paper for or- 
der optimization can only improve the quality of exe- 
cution plans produced by an optimizer. In cases where 
an execution plan's performance would degrade, which 
can happen with sort-ahead, an optimizer would sim- 
ply pick a better alternative using its cost estimates. 
Therefore, the only question is whether the improve- 
ment in performance offered by our techniques is worth 
the implementation effort. More specifically, are there 
a lot of "real world" queries where the improvement in 
performance is significant? 

IBM maintains a number of internal benchmarks that 
have been inspired by real DB2 customers over the 
years. On those benchmarks and at customer sites, we 
have observed substantial improvement in the perfor- 
mance of many queries because of the techniques de- 
scribed in this paper. The biggest improvements are 
typically seen in decision-support environments with 
lots of indexes. Often, applications in these environ- 
ments cannot fully anticipate the predicates that will 
be specified by end-users at runtime. Nor can they an- 
ticipate schema changes, such as the addition of a new 
index or key. As a result, queries in these environments 
frequently include a lot of redundancy - grouping on 
key columns, sorting on columns that are bound to con- 
stants through predicates, and so on. Order optimiza- 
tion is able to eliminate this kind of redundancy, which 
in turn usually leads to a better exection plan. 

8.1 TPC-D Results 

Unfortunately, the benchmarks described above are un- 
known outside of IBM. Therefore, we turn to the TPC- 
D benchmark 1 to illustrate how much our techniques 
for order optimization can improve performance. A de- 
scription of the TPC-D benchmark and its schema is 
omitted. For details, readers are directed to [Eng95]. 

TPC bylaws prohibit us from disclosing a full set of 
unaudited TPC-D results. Moreover, IBM was reluc- 
tant to let certain results be published when this paper 
was written. Consequently, the focus here will be on just 
Query 3 of the TPC-D benchmark. Query 3 was chosen 
because it is (relatively) simple and benefits from sev- 
eral of the techniques that have been described in this 
paper. Query 3 retrieves the shipping priority and po- 
tential revenue of the orders having the largest revenue 
among those that had not been shipped as of a given 
date. It is defined as follows: 



1 TPC-D is a trademark of the Transaction Processing Concil. 
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select Lorderkey, 

sumfLextendedprice * (1 - Ldiscount)) as rev, 

o-orderdate, oshippriority 
from customer, order, lineitem 
where ojorderkey = Lorderkey 
and cjcustkey = o.orderkey 
and cjmktsegment = 'building * 
and ojorderdate < datef 1995-0$- 15') 
and Uhipdate > dateCl995'08-15') 
group by Lorderkey, ojorderdate, ojshippriority 
order by rev desc, ojorderdate 

To gather performance results, we built a modified 
version of DB2 with order optimization disabled. Then 
we ran queries on both the production and disabled ver- 
sion of DB2. Results were obtained on a 1GB TPC-D 
database using a single IBM RS/6000 Model 59H (66 
Mhz) server with 512MB of memory and running ATX 
4.1. A real benchmark configuration was used, with 
data striped over 15 disks and 4 1/0 controllers. Us- 
ing a combination of big-block I/O, prefetching, and 
I/O parallelism, this configuration was able to drive the 
CPU at 100% utilization. 

The results for Query 3 are shown in Table 1. The 
numbers in the table correspond to the elapsed time to 
run Query 3, averaged over five runs. As shown, the 
elapsed time for the version of DB2 with order opti- 
misation disabled was significantly slower than the pro- 
duction version of DB2 (by a ratio of 2.04). 



Production DB2 


Disabled DB2 


Ratio 


192 sec. 


393 sec. 


2.04 



Table 1: Elapsed Time for Query 3 

The execution plan chosen by the production version 
of DB2 is shown in Figure 7. Using a combination of 
Reduce Order, Cover Order, and Homogenize Order, 
the DB2 optimizer was able to determine that it was 
beneficial to push the sort for the GROUP BY below 
the nested-loop join. This sort not only provided the re- 
quired order for the GROUP BY, but it also caused the 
index probes in the nested-loop join to become clus- 
tered. We refer to these as ordered nested-loop joins. 
Here, an ordered nested-loop join is especially impor- 
tant because it allows prefetching and parallel I/O to 
be used on the lineitem table, which is the largest of all 
the TPC-D tables. 

In Figure 7, note that the sort on o-orderkey 
satisfied the GROUP BY because of the equiva- 
lence class generated by the predicate o-orderkey = 
Ijordtrkey and because of the FD {ojorderkey} -> 
{ojorderdate , ojshippriority} . In SQL queries, there is 
often no choice but to include functionally dependent 




Figure 7: Query 3 in Production Version of DB2 



(i.e., redundant) columns like these in a GROUP BY, 
since that is the only way to have them appear as out- 
put. 

For comparison, the execution plan chosen by the ver- 
sion of DB2 with order optimization disabled is shown 
in Figure 8. In this case, the DB2 optimizer was un- 
able to detect that the sort on ojorderkey satisfies the 
GROUP BY. Moreover, without an awareness of equiv- 
alence classes, the optimizer was unable to determine 
that the same sort could be used to generate an ordered 
nested-loop join for the lineitem table. Consequently, a 
more costly merge- join was used. 

9 Conclusion 

This paper described the novel techniques that are used 
for order optimization in the query optimizer of IBM's 
DB2. These general techniques, which can be used by 
any query optimizer, make it possible to detect when 
sorting can be avoided because of predicates, keys, in- 
dexes, or functional dependencies; the minimal number 
of sorting columns when a sort is unavoidable; whether 
a sort can be pushed down into a view or join tree to 
make it cheaper; and whether two or more sorts can be 
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Figure 8: Query 3 with Order Optimization Disabled 



combined and satisfied by a single sort. For complex 
queries in a data warehouse environment, these tech- 
niques can mean the difference between an execution 
plan that finishes in a few minutes verses one that takes 
hours to run. 

This paper's main contribution was a set of funda- 
mental operations for use in order optimization. Al- 
gorithms were provided for testing whether an inter- 
esting order is satisfied, for combining two interesting 
orders, and for pushing down an interesting order in 
a query graph. All of these hinge on a core operation 
called Reduce Order, which uses functional dependencies 
and predicates to reduce interesting orders to a simple 
canonical form. 

This paper also described the overall architecture of 
the DB2 optimizer for order optimization. In particular, 
the paper described how order, predicates, keys, and 
functional dependencies can be maintained as access 
plan properties. The importance of maintaining func- 



tional dependencies as a property goes beyond just order 
optimization. Functional dependencies can be used for 
other optimizations as well [DD92]. 

Finally, results for Query 3 of the TPC-D benchmark 
were provided to illustrate how much the techniques de- 
scribed in this paper can improve performance. On a 
1GB TPC-D database, a version of DB2 with order op- 
timization disabled ran Query 3 roughly 2x slower than 
the production version of DB2 with order optimization 
enabled. 
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